Economic and Environmental Analysis of Life Expectancy in China and India: A Data Driven Approach

A R T I C L E I N F O A B S T R A C T Article history: Received: 08 June, 2020 Accepted: 13 September, 2020 Online: 17 September, 2020 A data analytic approach presented in this work covers both data descriptive and predictive modeling with two main objectives: (1) discovering factors related to longevity of populations in the two most populated nations, China and India, and (2) generating life expectancy predictive models for both countries. Descriptive modeling methods to explore major environmental and economic factors anticipating to affect longevity patterns of people are web graph analysis and chi-squared automatic interaction detection (CHAID) techniques. Web graph analysis has been applied for the ease of visualization and CHAID is for discovering factors leading to longevity. From the analysis results, particulate emission including ozone pollution and PM2.5 concentrations are the most important factor threatening life of populations in both China and India. To predict number of years an individual is expected to live based on the available environmental and economic factors, several statistical and machine learning techniques are applied and it turns out that a linear regression model yields the most accurate prediction result.


Introduction
Longevity, education and income are the three main indicators that the United Nations Development Program (UNDP) has adopted for computing the human development index (HDI) to assess development level of each country [1]. The HDI captures three main essential aspects: a long and healthy life, an ability to have sufficient knowledge, and a decent standard of living. The health aspect is measured from life expectancy at birth, which is number of years a new-born baby is expected to live, averaging from the cohort. The knowledge aspect is measured from years of schooling. The standard of living aspect is measured from the gross national income per capita. The HDI is the geometric mean of all three aspects that have gone through the normalized process. This paper focuses on the longevity indicator through the life expectancy at birth measure, as it is considered [2], [3] a reflect for the good health of population.
The analysis of longevity trends within and across nations is of interest among many groups of researchers [4]- [7]. The number of years an individual is expected to live is also important to actuaries for making an optimal and economical insurance and pension plans [8]- [10]. Both positive and negative factors affecting longevity have been investigated by several researchers. For instance, Chinese researchers had explored the factors relating to energy consumption in daily life of Chinese people through the use of coal and electricity [11], [12]. The results are that coal usage relates to shorten life, whereas domestic electricity usage shows positive correlation to longevity. However, the choice of energy sources depends on the household income. Such economic and socioeconomic factors had been proven by many researchers [13], [14], [15] that they can affect the long-life of population.
From the literature review, it can be noticed that most researchers study longevity by building a predictive model to forecast number of years the populations are expected to live using various methods including regression [24], autoregressive integrated moving average [25]- [27], and neural network algorithms [28], [29]. Some researchers [30] apply an ensemble method to make forecasting through a number of models and use the averaging scheme to predict the years of living.
In this work, we make our contribution to the demography as well as the environmental fields of research by proposing a different longevity analysis approach in which both numeric and categorical modeling are applied, instead of the sole numeric computation. Our data analytic method generates descriptive model through categorical computation to reveal life-threatening factors and also produces predictive model to make a numeric prediction toward number of years an individual is expected to live based on the available economic and environmental factors. The data source and our analysis methodology are explained in Section 2. Results from the data analytic approach are presented in Section 3. Performance evaluation of the predictive models is shown in Section 4. The conclusion is in Section 5.

Materials and Methods
The main purpose of our research is to present descriptive and predictive modeling methods to automatically discover major factors influencing good health of people living in the highly populated countries. We choose China and India to be our case study because the two countries are the most populated ones in the world (China = 1.43 billion, India = 1.36 billion [31]) and the trends of human development index (HDI) of the two countries are almost similar. The geographic location of China and India and the trend in HDI improvement from the years 1990 to 2018 are shown in Figures 1 and 2

Forest_ depletion
Net forest depletion --if growth exceeds harvest, this figure is zero % of GNI

Particulate_ emis
Particulate emission damage -calculated as foregone labor income due to premature death due to exposure to ozone pollution and indoor concentrations of PM2.5 in households cooking with solid fuels % of GNI

Agri_met_ emis
Agricultural methane emissions -from animals, animal waste, rice production, agricultural waste burning % of total Among all 17 numeric data attributes, life expectancy at birth is the target of our analysis. The main steps in the data-driven analytical approach are illustrated in Figure 3.
The first step is data extraction including the selection of data attributes from the World Bank database and the data preparation to be in a suitable format for further analysis steps. The next step is data exploration, which is the analysis of correlation among data attributes. The third step is the discovery of factors threatening life of population in China and India. This step needs the transformation from numeric to be categorical through the binning approach. The descriptive model to reveal important factors affecting lifespan is derived by the algorithm chi-squared automatic interaction detection, or CHAID [33]. This algorithm has been adopted because its model represented as a tree structure has many advantages such as efficiency, interpretability and successful adoption to solve a wide range of problems [34], [35].
To visualize factor association, we adopt the web graph method. The fourth and also the last step of our analytical method is the predictive model generation. We adopt 8 algorithms in this modeling step including regression, generalized linear model (GLM), k-nearest neighbors (kNN), support vector machine (SVM), classification and regression tree (CART), chi-squared automatic interaction detection (CHAID), artificial neural network (ANN), and linear.
The obtained models are evaluated for their performances observed during the training process. The best 5 models are then selected for further testing with the hold-out method in which the separated test dataset has been applied. In the training and testing steps, the total 48 records of data during the years 1960-2017 had been split as two separate subsets: training and testing. The training dataset comprises of 45 records, whereas the testing dataset contains 13 records.
After evaluating prediction performance of the best 5 models using the same set of test data, the most accurate model is kept as the final model. This model is to be used for forecasting the number of years an individual is expected to live. It is worth noting that forecasting model for one country may be different from another because the modeling is a data driven approach such that different set of training data may yield different modeling result, even though set of data attributes is the same.

Analytical Results
According to the design of our data analytical approach, there are three main steps of data analysis: data exploration, descriptive model creation, and predictive model generation. We thus present the results from these three main steps sequentially in the following three subsections.

Data Exploration Results
The results of Pearson correlation analysis is shown in Table 2. It can be noticed similarity between the two countries that source of energy and amount of energy usage have strong positive influence toward longevity. The values in terms of import, export and industry are also among the top-5 positive factors associating with longevity.
In case of correlation analysis to reveal negative factors to lifespan of population, we can notice that particulate emission is among the top-five factors associating to life shortening in both China and India. However, other factors such as forest depletion, forest area, the export of high-technology product, the expense in education and the national income per capita are also appeared as having negative influence toward longer life of population in the two countries. These results are preliminary data exploration. The next step is the in-depth analysis discovering only prominent factors relating to longevity of population.

Descriptive Modeling Results
Descriptive analytics refer to the process of applying statistical and other intelligent techniques to provide insight into the historical data to gain some understanding about the important factors, hidden patterns or concealed behavior. To understand characteristics of longevity pattern, we apply CHAID algorithm to reveal prominent factors affecting lifespan and display it as a tree structure as shown in Figure 4 (root of a tree is on the left hand side). It can be seen from the patterns that particulate emission damage is the most important environmental factor shortening lifespan of populations in both China and India.
A web graph to display association between economic and environmental factors and life expectancy at birth for each population group is also shown in Figure 5. The thickness of line linking each node in a graph represents strength of association. The thicker is the stronger. The web graph also shows the strong association between a short lifespan and a high level of particulate emission damage for both Chinese and Indian populations.

Predictive Modeling Results
Predictive modeling is the data analytical approach to generate a model with the main aim of using that model to forecast future event. We apply both statistical and machine learning methods to build a predictive model from the training dataset. From the model performance comparison, linear regression is a method yielding the most accurate result for both China and India cases. The linear regression models to predict lifespan of population in each country are presented in Figure 6. It can be noticed that education expense is among the first two factors appeared in the models and it shows positive influence toward long life of population in both countries.

Model Evaluation
We test performance of descriptive and predictive models with a separate set of test data. The descriptive model derived by CHAID algorithm to reveal longevity pattern of population in China with 100% of accuracy, whilst the accuracy of CHAID model drops to 84.62% when the algorithm tries to fit model with data from Indian training cases. The decrease in accuracy may occur from the low level of homogeneity in the training data.
To build a predictive model, we apply eight algorithms and then select the best five models to test performance on a hold-out dataset. The evaluation results for the case of China population are illustrated in Table 3, whereas the results of India population modeling assessment are shown in Table 4. The best predictive model from both population groups with the least mean absolute error is the one derived from the linear regression algorithm.

Conclusion
We present data analytical approach to create descriptive and predictive model with the two main aims: to discover threatening factors that may shorten lifespan and to predict the number of years on average an individual is expected to live based on the economic and environmental variables. The analytical approach is data driven in the sense that model is to be generated from the training data. Therefore, using different datasets can result in obtaining different models, even though the same data variables and the same learning algorithm have been applied.
To derive a descriptive model, we adopt the CHAID algorithm. The training data are World development indicators accessed from the World Bank database with two specific countries, China and India, and the target of analysis is life expectancy at birth which is one aspect that the UNDP applied for computing the human development index of a country. The models of China and India show one common finding that particulate emission damage due to the exposure to ambient ozone pollution and PM2.5 is the most important factor affecting shorter longevity. We also apply several learning algorithms to derive predictive models to forecast number of years an individual is expected to live based on the available economic and environment factors. The most accurate predictive model is the one built from the linear regression method.