A Typological Study of Portuguese Mortality from Non-communicable Diseases

Article history: Received: 10 July, 2020 Accepted: 19 September, 2020 Online: 05 October, 2020 The most common non-communicable diseases, such as cardiovascular diseases and cancer, are a problem in global and national growth. The World Health Organization considers it a priority to study the specific causes of these diseases for trend monitoring. The aim of this paper is to identify a hierarchy of clusters of Portuguese mortality by noncommunicable diseases using the agglomerative hierarchical cluster analysis. The Euclidean distance with complete linkage and average linkage criteria are used. These methods identify six clusters with both criteria, indicating some order of disease severity in the way clusters joint together. Special attention should be given to diseases in the last two clusters, where the last one is formed by ischemic heart disease, cerebrovascular diseases and larynx / trachea / bronchi and lung malignant tumor, all for males. In fact, these clustering results show that male gender seems to be a risk factor for at least two groups of the non-communicable diseases. Other suggested risk factors and / or pathophysiological mechanisms that in a direct or indirect way may enhance the common development of the pathologies found in the clusters arising from this study should also be an object of priority study.


Introduction
An important challenge of Public Health in Portugal is to decrease the number of deaths from Non-communicable diseases.
In order to generate additional information in this domain, this paper is an extension of work originally presented in 14th Iberian Conference on Information Systems and Technologies [1].
According to the World Health Organization (WHO), the number of people who will suffer from the most common non-communicable diseases, including cardiovascular disease, cancer, or chronic respiratory diseases, will increase [2]. Noncommunicable diseases are a worldwide and national problem and Public Health concerns [2], [3]. The study of non-communicable diseases is a matter of major importance in the world and in particular in Portugal [4], [5]. According to the Directorate General for Health (DGS) cardiovascular diseases and major cancers are mostly responsible for the lost years of life in the Portuguese population aged 35 and over [6]. The mortality rates of lung tumor malignant and rectal tumor malignant are increasing in Portugal ASTESJ ISSN: 2415-6698 [6]. Decreasing mortality from cancer is the biggest challenge for the next generations because cancer is the leading cause of early mortality [7]. The decrease in early mortality is relevant due to its individual and social impact [2]. Thus, according to WHO, it is of great importance to generate additional knowledge on the specific causes of these diseases, in order to monitor their trends [4]. Therefore, different studies are fundamental to analyze diseases evolution that might help to formulate hypotheses about risk factors and / or common pathophysiological mechanisms that, directly or indirectly, may enhance the common development of pathologies found in clusters, resulting from the hierarchy of noncommunicable disease partitions, as this may be a working tool in health [8] [9]. Unsupervised data analysis, namely cluster analysis, then can be useful to raise hypotheses about clusters of noncommunicable diseases to be studied.
The development of the clustering analysis methodology is truly interdisciplinary. Taxonomists, social scientists, health scientists, psychologists, biologists, statisticians, mathematicians, engineers, medical researchers, computer scientists and others all contribute to the development of this methodology [10], [11]. Clustering analysis is a set of exploratory multivariate data analysis methods for identifying natural clusters in the data, based on a coefficient of similarity or dissimilarity between individuals or between variables, or more generally between statistical units of data [12]- [14]. This methodology aims to find homogeneous groups of statistical units in the data, where similarities between statistical units belonging to the same group are high and similarities between statistical units belonging to distinct groups are low [10], [12]- [14]. Hierarchical cluster analysis algorithms, both agglomerative and divisive, provide a hierarchy of partitions. Agglomerative hierarchical classification analysis methods are the most used [14] [15]. These methods usually start with all statistical units separated into single-element groups (unit set -singletons), forming a cluster partition where the number of clusters is equal to the number of statistical units, and successively grouping together the most similar groups (according to the measure of similarity or dissimilarity between statistical units and the criterion of aggregation between groups) in the same cluster, until it forms a partition of a single cluster. It is recognized that agglomerative hierarchical cluster analysis can give different results for the same data, depending on the choice of the measure of comparison between statistical units and the aggregation criterion between groups [13], [16], [17]. One of the most used measures between individuals is the Euclidean distance. Moreover, the most commonly used aggregation algorithms are the farthest neighbor (or complete linkage) and the nearest neighbor (or single linkage) ones [10] [12]. The complete linkage algorithm produces compact clusters, but is sensitive to outliers [16]- [18]. The single linkage algorithm is censored because it may disregard the structure of clusters, and it has the tendency to construct hierarchical partitions with chain effect [13], [16]- [18]. The average linkage method produces more balanced hierarchies, with no chain effect. This algorithm is an intermediate method between the two previously mentioned, and furthermore uses more data information [13], [16]- [18]. Therefore, the three aggregation criteria are often used over the same data sets, with the aim of knowing about clustering results robustness. Each agglomerative hierarchical cluster analysis is usually based on information concerning some "relative best partitions" of the hierarchy, namely immediately preceding relative maximum levels of increase / decrease in cluster dissimilarities [13], [19]. Moreover, by choosing a cut-off where there is the largest increase / decrease in cluster dissimilarities [13], it allows us to find an "absolute best partition". This suggests that, at the stage where the greatest increase / decrease occurs, the joining clusters are comparatively far apart. Therefore, the best number of groups in the data should be the number of groups present in the immediately preceding step [13], [16], [17].
Goal: The aim of this study is to obtain a hierarchy of cluster partitions of Portuguese mortality from non-communicable diseases, using agglomerative hierarchical cluster analysis, in order to identify clusters of diseases associated to different degrees of severity as well as clusters of diseases explicitly associated with gender. Clustering results might help either to identifying or to searching for common causes or risk factors to improve preventive medicine.

Data
The classification of diseases in this paper is according to the Tenth Revision International Classification of Diseases (ICD-10). This classification makes it possible for countries compile comparable national mortality [20]. The non-communicable diseases used here are listed in Table 1.
In this paper, it is presented an evolution of a previous work presented at [1]. In fact, a results analysis is performed considering diseases separated by gender. Table 2 displays the shorts names of the diseases, where M represents male gender and F represents female gender. Annual age-standardized mortality rates are given by the ratio of the total number of deaths expected in the specific region for a given gender for a given disease over a year and the number of standard populations per 100 000 inhabitants. In theory any standard population could be used but frequently are used the Segi world, the European or the WHO world standard populations.
Here it is used the European standard population. Agestandardized mortality rates enable comparisons to be made between populations that have different age structures. More details on methods of standardizing mortality rates can be found in [21], [22]. The mortality rates considered are based on individuals up to 65 years of age.

Methodology
In this paper agglomerative hierarchical cluster analysis of non-communicable diseases is applied. The age-standardized mortality rates for each of the years from 1994 to 2012 are used as variables to perform the analysis and the non-communicable diseases separated by gender are used as individuals. Since annual standardized mortality rates are quantitative variables, the Euclidean distance to measure disease dissimilarity is used. Euclidean distance depends less from the magnitude of values than the quadratic Euclidean distance, used in [1]. As aggregation criteria the three methods mentioned above were used. The clustering results were very similar, so in this presentation are only considered the average linkage and the complete linkage criteria.

Results
For all diseases listed in Table 1 in the case of male gender are presented in Table 3 the mean, standard deviation, minimum and maximum of annual age-standardized mortality rates between 1994 and 2012. As shown in Table 3 the individuals of male gender died more by larynx / trachea / bronchi and lung malignant tumor, ischemic heart disease and cerebrovascular diseases in Portugal between 1994 and 2012. Note that the standard deviation of annual agespecific standardized mortality rates in the case of larynx / trachea / bronchi and lung malignant tumor is very low so the mean is well representative of the mortality rates over the years. The standard deviation of annual age-specific standardized mortality rates in the cases of ischemic heart disease and cerebrovascular diseases is bigger than in all other cases. In fact, the mortality rates in these cases presented a decrease over the years while in the case of larynx / trachea / bronchi and lung malignant tumor the mortality rates are very similar. It can also be seen that in every other disease the standard deviation is low, so that does not exist a great difference between the mortality rates over the years. Table 4 presents the statistics summary in case of female gender. As shown in Table 4, female gender individuals died more by breast malignant tumor, cerebrovascular diseases and ischemic heart disease in Portugal between 1994 and 2012. Note that the annual age-specific standardized mortality rates in the case of breast malignant tumor presents a decrease until 2004 while their values are very similar after that year. In the case of cerebrovascular diseases the mortality rates present a decrease until 2006 and are very similar after. For ischemic heart disease the mortality rates decrease over the years, nevertheless cerebrovascular diseases and ischemic heart disease seem to have a much higher mortaly rate in male than in female. In other cases the standard deviation of annual age-specific standardized mortality rates is very low, showing stability of mortality rates. From the application of the agglomerative hierarchical cluster algorithms, the following results were obtained. Table 5 contains the aggregation order matrix from the application of the agglomerative hierarchical cluster analysis with the Euclidean distance and average linkage aggregation criterion between groups. In Table 5 the columns Cluster 1 and Cluster 2 indicate the disease groups aggregating at each stage. The aggregation coefficient value in each step is noted by "Aggr. Coef." and the increasing aggregation coefficient between successive steps is noted by "Dif.".  Figure 1 presents the dendrogram associated to Euclidean distance and average linkage aggregation criteria.

Agglomerative hierarchical cluster analysis with Euclidean distance and average linkage criterion
From Table 5 and Figure 1 it may be observed the dendrogram has 39 levels. Reading the tree from top to bottom we find six well separated groups of diseases. The first group includes breast malignant tumor for males, other heart diseases (I30-I33) for females and males, kidney malignant tumor, esophageal malignant tumor, bladder malignant tumor and asthma for females, asthma for males, lip/mouth and pharynx malignant tumor, as well as skin malignant tumor for females, skin malignant tumor for males and liver / biliary / intrahepatic malignant tumor for females. Thus the first group includes a mixture of diseases, where some paired diseases for females and males appear.   Figure 2 presents the dendrogram associated to Euclidean distance and complete linkage aggregation criterion. As we point out above, the aggregation complete linkage and average linkage criteria gave very similar clustering results over our data. Comparing Figure 1 and Figure 2, it can be seen they both display dendrograms with 39 levels, showing six well separated clusters of diseases, grouped following a similar order in the hierarchy. Moreover, reading the tree from top to bottom in Figure  2, and comparing each cluster to the corresponding cluster in Figure 1, we find that: the first group is the same obtained with the average linkage aggregation criterion; the second group differs from the corresponding in average linkage criterion only because it doesn't include the pneumonia for females; the third group contains all the diseases included in the third cluster of average linkage criterion plus the pneumonia for females; the last three groups are the same obtained with the average linkage criterion.

Discussion
Let's point out again that for analyzing and understanding clustering results obtained from hierarchical agglomerative methods based on Euclidean distance applied to diseases, it is expected that: 1-diseases in the same cluster will have similar mortality rates and similar behavior of the mortality rates over the years; 2-clusters of diseases indicate that the order of magnitude of mortality rates and the behavior over the years of mortality rates show larger differences between diseases in different clusters than diseases inside each cluster. Both aggregation criteria produce similar hierarchies with six well separated clusters (differing only by the classification of pneumonia for females), joining in same way. If one refers to the average linkage aggregation criterion the first group is formed at level 15, the second one at level 21, the third one at level 30, the fourth group at level 31 the fifth at level 34, and the six group at level 37, corresponding to some kind of chain effect in the dendrogram, indicating some order of disease severity in the way clusters joint together. The cut-off occurs at level 38, where the sixth cluster is separated from the cluster of every other diseases. Reading the tree from top to bottom, the first three groups include diseases with lower mortality rates and the last three put together the diseases with highest mortality rates. Then the diseases that seem to need more attention are the diseases in clusters four to six. Looking at Figures 1 and 2 it is clear that groups formed by diseases with lower mortality mainly occur for females while groups formed by diseases with higher mortality tend to be associated to male gender, so male gender appears to be a risk factor in many of listed diseases. The fourth cluster reveals that lip/mouth and pharynx malignant tumor, colon malignant tumor, lymphatic tissue malignant tumor, other heart diseases (I39-I52) in male gender, ischemic heart disease for females, and pneumonia for males need the same attention by health organizations in Portugal since they are in the same cluster. This suggests that may be there are common causes for these diseases that affect more the male gender than female gender. Looking again at Figure 1 and Figure 2, it can be seen that the diseases that cause more deaths are found in fifth and sixth clusters. In the case of breast malignant tumor and stomach malignant tumor, mortality rates have been decreasing [6], possibly due to preventive measures; however, it can be seen in the fifth cluster that the stomach malignant tumor for males provoke so many deaths like cerebrovascular diseases and breast malignant tumor for females and they come together with high values of mortality rate. Note that in the case of breast malignant tumor being woman is a risk factor. Diet and obesity possibly being risk factors for the diseases in the fifth cluster more preventive measures are necessary in that direction. The sixth cluster, very well separated from all the other clusters, includes ischemic heart disease, cerebrovascular diseases and larynx / trachea / bronchi and lung malignant tumor for male gender, diseases which present the largest mortality rates, showing that they deserve special attention. This cluster also appears very well separated from all the other clusters in previous work [1]. The results shown in the present work, point that the male gender may be a risk factor for some clusters of diseases. One explanation may be habits more associated to the male gender as smoking. Additionally, the role of hormonal and metabolic characteristics of the male versus female gender is yet to understand. The results confirm that mortality rates of the respiratory tract malignant tumors and cerebrovascular diseases are the highest and it is necessary to reinforce preventive measures in case of these diseases even that Portugal is significantly better than the mean of the high-middle Socio-Demographic Index group for ischemic heart disease and lung cancer [6]. The fact that ischemic heart diseases and cerebral vascular diseases are in the same cluster it is expected. In both pathologies there are common underlying causes such as the formation of atheromatous plaques and their relationship with obesity and hypercholesterolemia. The similarity of these pathologies to malignant tumors of the respiratory tract is may be related with the known increase of thromboembolic events secondary to tumors but this requires further studies to understand and elucidate this result. However, risk factors such as smoking are common to the three pathologies which may further relate those pathologies [23].

Conclusion
Public Health as a science for studying and preventing diseases, prolonging life and improving quality of life through organized efforts and informed choices needs to analyze the factors of health of a population [24]. Unsupervised data analysis can introduce new knowledge allowing organizations to take measures to provide the best health warranties of the general population.
The aim of this study was to search for an agglomerative hierarchical cluster analysis of Portuguese mortality from noncommunicable diseases, in order to first of all identify clusters of diseases associated to different degrees of severity as well as clusters of diseases explicitly correlated with gender.
The obtained hierarchy provides six main clusters of diseases. Moving up the hierarchy it is found that these clusters are sequentially formed, some increasing order of disease severity appearing correspond to the increasing order of levels. The first groups include diseases with lower mortality rates while the last ones put together diseases with higher mortality rates.
Special attention should be given to the last three clusters since they contain diseases showing high degrees of mortality rates compared with other clusters. Moreover, the fourth cluster mainly includes diseases in male individuals and the sixth one, grouping the highest severe diseases, refers only to males. Thus, male gender seems to be a risk factor for these two clusters of diseases. Other suggested risk factors and / or pathophysiological mechanisms that in a direct or indirect way may enhance the common development of the pathologies found in the clusters arising from this study should also be an object of priority study.
Note that taking gender in account in the present study clearly improve results obtained in previous work [1]. Clustering results might help either to identifying or to searching for common causes or risk factors to improve preventive medicine.
One limitation of this work is the fact that the time period for study does not include data from recent years yet. It happens the work is included in a larger project and it is important to get these results. In future, recent data will be added, as well as countries comparison and other multivariate data analysis (e. g., factor analysis and other cluster analysis methods, like fuzzy hierarchy techniques). Data collection is in process.