Supervised Machine Learning Based Medical Diagnosis Support System for Prediction of Patients with Heart Disease

A R T I C L E I N F O A B S T R A C T Article history: Received: 20 July, 2020 Accepted: 01 September, 2020 Online: 17 September, 2020 Application in the field of medical development has always been one of the most important research areas. One of these medical applications is the early prediction system for heart diseases especially; coronary artery disease (CAD) also called atherosclerosis. The need for a medical diagnosis support system is to detect atherosclerosis at the earlier stages to optimize the diagnosis, avoid the advanced cases, and reduce treatment costs. Earlier, the datasets are collected from specific medical sources and have evaluated against computer applications. In this paper, a supervised machine learning medical diagnosis support system (MDSS) for atherosclerosis prediction is presented that able to obtain and learn automatically knowledge from each patient's clinical data. Therefore, we used three Machine Learning (ML) classifiers for the proposed MDSS for atherosclerosis. Thus, this work is accomplished using databases collected from the UCI repository (Cleveland, Hungarian) and Sani Z-Alizadeh dataset. The performance metrics were computed utilizing Accuracy, Recall and Precision. Furthermore, F1-score and Matthews’s correlation coefficient these measures were also calculated to greatly increase the proposed system performance. Additionally, 10-fold cross-validation methods have been used for proposed model performance evaluation that achieved 94% as the best accuracy average. Consequently, the proposed model can be used to support healthcare and facilitate largescale clinical diagnostic of atherosclerosis diseases.


Introduction
As stated by the World Health Organization (WHO), heart disease is one of the leading causes of death when the heart is unable to pump oxygenated blood through the body [1]. There are other forms of Cardiovascular Disease (CVD), including coronary artery disease (CAD), also called atherosclerosis. This disease narrowed arteries and buildup of plaque caused by cholesterol in the blood. This ailment occurs due to narrowed or blocked blood vessels and coronary arteries because of the plaque accumulation. This plaque is made of cholesterol, calcium and other substances. As the buildup increases, the plaque reduces blood flow to the coronary arteries. Therefore, the flow in the myocardium decreases. This can cause symptoms such as angina. The pain can be in the chest, shoulder, abdomen, arms, and neck. During this pain, the oxygenated blood decreases. This situation called myocardial ischemia. When the coronary artery has near completely narrowed, the myocardium tissue dies and leading a heart attack (myocardial infarction) [2,3].
Here, it seems important to establish and develop a medical diagnostic support system (MDSS) to automate the classification and prediction of CVD. However, medical diagnostic research requires greater precision and efficiency to make the best clinical decisions. Although classical MDSS has proven its ability to solve most diagnostic problems, it offers a lower accuracy factor and is unable to make a correct diagnosis [4][5][6][7].
In this case, we propose a new MDSS using some selected ML algorithms. The main goal is to classify and predict the patient's health issue based on the principal chosen features by analyzing the heart disease databases. Atherosclerosis risk factors have been identified from the knowledge and the expertise of medical experts and doctors. These risk factors are known as uncontrollable risk factors and controllable risk factors. The identification of these factors is based on several features. Uncontrolled Atherosclerosis risk factors contain family history, age and gender [3].
The remainder of this paper is structured as follows: In the second part (Section II), we review some related work in the literature. In the third part (Section III), we have presented and explained our proposed system process. In particular, we present the global flowchart of the proposed MDSS and the selected machine learning algorithms; in addition to used CAD datasets. The fourth part (Section IV) describes the evaluation parameters used to assess and compare our MDSS performance with similar measures. In the fifth part (Section V), we showed the details of implementation and presented the results and discussions. The last part (Section VI) concluded this work and gave certain proposed perspectives.

Related work
In this part, we have presented several selected works from literature review on automatic heart disease diagnosis. These works used the same well-known databases and that we will consider later for the performance comparison.
In [17] , The authors applied neural network integration methods to build new models by linking predicted values from previous models. Compared to the ML algorithm, the accuracy rate presented 89.01%. Another work published in [18], the authors suggested a clinical decision support system (CDSS) using Weighted Fuzzy Rules (WFR) for predicting heart disease. They used two scenarios of evaluation; the first scenario automatizes the approach for the WFR generation while the second scenario develops a fuzzy rule-based CDSS. They tested their CDSS using the Cleveland's heart disease database. Compared to the system based on a neural network, the best precision value obtained by this method is 62.35%.
In [19], the authors applied Fast Decision Tree (FDT) and C4.5 tree pruning methods. This approach aims to integrate the machine learning analysis results in different CAD databases. The outcomes showed that the classification accuracy is 78.06% which is higher than the average classification accuracy of separate datasets of 75.48%. Recently in 2017, the authors in [20] proposed a Hybrid Neural Network-Genetic (HNNG) to improve the neural network by strengthening its initial weights based on a genetic algorithm. The highest accuracy rate is 93.85% using Z-Alizadeh Sani data set and the Cleveland's heart disease database.
Other approaches have covered the medical diagnosis issue of heart diseases. In [21], the authors have depicted the CDSS performances for heart failure risk prediction. This system based on two methods, Fuzzy Analytic Hierarchy Process (Fuzzy_AHP) and artificial neural network (ANN). The result shows that compared to the traditional ANN method, the average prediction accuracy of this method reaches 91.10%. More recently, in 2018 the authors of [22] presented the design and implementation of the MDSS for heart diseases. This system is developed using the Fuzzy_AHP method and Fuzzy Inference System (FIS). The results of the developed method indicate the possibility of having a heart disease. From the experimental results it has been proven that the AI and ML methods in the medical field have given good results. In [26], used ML methods, which are Naive Bayes (NB), Random Forest (RF), Support Vector Machine (SVM), ANN, and K-Nearest Neighbours (KNN) algorithms. These ML methods used to improve CAD diagnosis. The reached average accuracy is higher than 80%. As well, specificity and sensitivity results are around 70% to 90%.
Too recently in [24], the authors developed a new method, called Hybrid Feature Selection (2HFS) applying Gaussian Naive Bayes (GNB), Random Forest (RF), Decision tree (DT) and Gradient Boosting (XGBoost) classifiers. In this study, authors have used Nasarian CAD database and they have also tested this approach with Long Beach VA, Hungarian and Z-Alizadeh Sani databases to achieve accuracies of 83.94%, 81.58% and 92.58% respectively.
This work aims to propose a new MDSS for diagnosis of patients with atherosclerosis. The proposed approach is based on five some selected ML algorithms: ANN, RF, Adaptive Boosting (AdaBoost), DT, and XGBoost. The study simulates the execution of the different algorithms configurations in order to evaluate the performance of the resulted models, and then choose which the best was; using performance evaluation methods to improve each one. The actual work is an improvement of our earlier research [13][14][15][16].

Global overview of the proposed MDSS
In this work, we proposed an MDSS using ML technique. This system based on three supervised ML algorithms. These classifiers have been applied to find the best prediction based on the chosen  Figure 1 shows the flowchart of the proposed work using ML algorithms.

Artificial Neural Network (ANN)
Artificial neural network (ANN) is inspired by the biological neural network to imitate human neurophysiology. At present, researchers have integrated statistical methods and numerical analysis into neural networks to give a mathematic model [27].
Where {x1, x2, …, xn} represent the n inputs, (Wi,n) represents the weights and (yi) are the outputs of the neural network using sigmoid function as a nonlinear activation function (f(.)) for each neuron. The activation function is given by equation (1): The ANN algorithm is achieved using the following equations Network equation: Predicted outputs equation: Error (ei ) using for the actual output (ti ) and the predicted output (oi ) equation: The last step in the ANN algorithm is to check if the standard stop error is reached. This means that the actual error (ei+1) is smaller than the last error and that the approximation of the total error function is valid.

Adaptive Boosting (AdaBoost)
Adaptive Boosting [28,29] as known AdaBoost, is an ML algorithm proved by Yoav Freund and Robert Schapire. This method can be used in combination with many ML algorithms to improve performance for binary classification. AdaBoost structure can be briefly defined as follows.
For each learner (t), AdaBosst calculated the weighted classification error as using the following equation: With (yn) is the true class label, (xn) is predictor vector for observation (n), (ht) is the hypothesis (learner predictor), (d n (t) ) is the observation weight in step (t), (II) is the indicator function and the AdaBoost trains learners sequentially.
AdaBoost can increase weights for each misclassified observation and reduces weights for each observation correctly classified.
After training phase, AdaBoost computes prediction using the following equation: with: Where (αt ) are the weak hypothesis weights in the ensemble.
AdaBoost training step can be considered as the exponential loss minimization using the following equation.

Decision Tree (DT)
Decision Tree (DT) is a supervised machine learning algorithm. This method is usually used in binary classification problems. The objective is to construct a set of choices in a tree graphic form consisting of nodes and branches based on each collected attribute [30].
The decision tree algorithm is achieved using the following equations: Probability (P(T)) to estimate that an observation (j) is in node (n) is defined with the following expression: Information gain (G(T, X)) for each tree's node to classify all input data is defined with the following expression: Where: (wj) is weight of the observation (j).
The entropy (E (T)) is defined with the following expression: Where: (pi) is the probability of the class i with i = 1, …, c with (c) is the total number of classes. In the case of binary classification c=2.

Cleveland dataset
Cleveland dataset is collected by David Aha for machine learning repository [31]. It is obtained from the Cleveland Clinic Foundation database of the University of California Irvine. This database consists of 76 attributes of which only 14 attributes are commonly used in most published researches: 13 inputs and one output. In this proposed work, only 270 instances are used from the 303 records patients owing to some missing values. It is noted that this dataset performs with 54% healthy subjects and 46% CAD patients. The healthy subjects are marked 0 while the unhealthy ones are designated by the value 1. Table 1 summarizes all used Cleveland Features.  (7) 14 num Diagnosis of heart disease Healthy (0), Patient has heart disease (1)

Hungarian dataset
The Hungarian dataset is collected by Andras Janosi, at the Hungarian Institute of Cardiology, Budapest [31]. This database contains 10 features. Through the 294 dataset samples, 262 samples were commonly used, 34 simples have been rejected because of missing values. The Hungarian simples are segregated in 62.21% healthy subjects and 37.78% with heart disease.

Z-Alizadeh Sani dataset
The Z-Alizadeh Sani dataset is randomly collected at Tehran's Shaheed Rajaei Cardiovascular, Medical and Research Centre. This dataset is built for CAD diagnosis, containing 303 samples with 54 features for each patient. The selected features include the main data on the patient's physical examinations, echocardiograms (ECGs), physical examinations, laboratory tests, demographic characteristics, and symptoms [20,25].
Alizadehsani et al [25] have classified patients into two outputs classes: 71% of patients suffered from CAD and 29% healthy. This dataset also contains stenosis prediction outputs of three coronary arteries i.e., LAD, RCA, and LCX. In this study, we have manually selected 17 features as the most important features according to the atherosclerosis risk factor [2,32].

Features selection
During the preprocessing step that consist essentially of the dataset cleaning (Ignoring inputs with missing values), the prediction inputs are based on the features of each database. Atherosclerosis risk factors have been identified from the expertise of medical experts and doctors. These risk factors are known as uncontrollable risk factors and controllable risk factors. The suitable features are chosen from each dataset as input data based on the related literature [2,30].
The corresponding outputs used for prediction are the binary labels "Diagnosis of heart disease" which reflects the actual condition of the patient considered. These 2 classes are: a patient has atherosclerosis or healthy. Here, a value of 0 means that there is no atherosclerotic disease, this means that the reduction in diameter is less than 50%. A value of 1 indicates the presence of atherosclerotic disease, which means that the diameter is reduced by 50% according to the database collected by UCI data (Cleveland and Hungarian). Regarding the Z-Alizadeh Sani database, the output is divided into two category labels. Therefore, Category 0 specifies that there is no atherosclerotic disease, which means normal. Category 1 indicates the presence of atherosclerotic disease, which indicates CAD.

Performance evaluation metrics
In this work, we used many performance methods to improve our proposed MSSD of atherosclerosis disease. These methods represent as following: • The Recall, the true positive rate (TRR) or the sensitivity calculates the degree of patients having correctly identified the disease.

= +
• Precision or Positive predictive value, this metric is the positive proportion result in diagnostic tests that is true positive results.
• Just like our case, the Matthews Correlation Coefficient (MCC) is a quality metric used for machine learning binary classification.
• F1-score (FS) that shows the precision harmonic means.
Where FN, TP, FP and TN are respectively false negative, true positive, false positive and true negative. In the ML field confusion matrix is also known as an error matrix. The matrix represents the performance of the algorithm, but it contains two types of information: the predicted value and the actual value. Table 2 explains the confusion matrix for the binary classification [33,34].

Simulation results and performance comparison
To prove the effectiveness of our proposed classifiers and predictors, many experiments and simulation were performed to empirically identify the best ML models. In this way, three sets of atherosclerosis data are used, and various performance evaluation methods are used to summarize the experimental results in tables to assess the effectiveness of the proposed method. A comparison of the obtained results with previous work was also conducted.

ML design and implementation
For ANN technique and as any empirical work, many simulations were conducted to select the best hyper parameters. As will be showed later, the best performance is reached for the following architecture configuration presented in table 3.
The learning parameters and neural network architecture used in each dataset in this study relate to the hidden layer, the number of neurons, the value of the learning rate, and the type of activation function in each layer. In DT algorithm, the first step is to calculate the entropy of the output or the target using the equation (11). The next step we obtained the entropy for each branch. The last step, the dataset divided by its branches and repeat the process every branch until all data is classified.
In the second algorithm AdaBoost, we calculate the weighted classification error using equation (5) for each learner. Then we reduce weights for each observation correctly classified by learner t. after finished training, we calculate the prediction for the new obtained data using equation (6). Then we minimize the exponential loss using equation (8).

Classification and prediction performance evaluation results on testing datasets
The classification techniques described above were implemented to identify subjects with and without heart disease. Those algorithms were compared using standard evaluation metrics: accuracy (ACC), precision, recall, F1-score (FS), Matthews's correlation coefficient (MCC), confusion matrix and Receiver Operating Characteristic curve (ROC).

Confusion matrix results
The MDSS for atherosclerosis is made based on three ML techniques: ANN, AdaBoost and DT algorithms. To validate our model, three databases were used: Cleveland, Hungarian, and Z-Alizadeh Sani database consisting of 270, 262 and 303 patients' records respectively as shown in table 4.
Each database is split on two datasets using interleaved indices: 80% been used for training, and 20% for testing. Then we trained the three classifier algorithms were compared to select the best one. Table 4 shows the results of the confusion matrix obtained after testing 835 patients collected from the Cleveland, Hungary and Z-Alizadeh Sani databases using the ANN, AdaBoost and DT algorithms.

Performance metrics results
During the testing phase, the outcomes are given to the proposed classification system to classify and predict patients with atherosclerosis. The achieved results are calculated using the standards performance metrics: ACC, precision, recall, FS, and MCC. To improve our atherosclerosis prediction system, two further machine learning metrics are used: FS as binary classification accuracy test and MCC as a binary classification quality measure.
The FS and MCC metrics should nearby 1 to assess on the system efficiency. Table 5 shows the obtained evaluation metrics for ANN, AdaBoost and DT algorithms using Cleveland, Hungarian, and Z-Alizadeh Sani databases.

Cross-validation
In this section, we present the analysis of system performance using the k-factor cross-validation technique. As a result, the databases are divided into k data sets. For each validation, one dataset is used as the test dataset and the rest of the datasets are used as the training dataset.
The principle of cross-validation is that we run a given model several times. In our case, ten times (K = 10), then we average the ten different tests, after that we average the test results of these K experiments. Obviously, this requires more computing time, as we have now conducted K separate learning experiments, but the evaluation of the learning algorithm will be more accurate. In other words, we use all the data for training and all the data for testing. In this case, we use the interleaved analysis method to divide each database into two parts: 80% of the training set and 20% of the test set.
When analyzing the Cleveland training dataset (graph shown in Figure. 2), The Cleveland database average accuracy computation, the ANN algorithm achieved 91.41% compared with the average accuracy of the other algorithms (AdaBoost and DT) are respectively 72.22% and 81.48% as shown in the graph. Similarly, in Sani Z-Alizadeh training dataset, the ANN algorithm achieved a higher accuracy (94.00%) compared with the other algorithms. in addition, the ANN algorithm average accuracy computation increased nearly by 10% higher than the other algorithms (AdaBoost and DT) are respectively 85.25% and 82.00% (graph shown in Fig. 2).

Receiver Operating Characteristic Curve (ROC)
In order to increase the prediction of healthy subjects and subjects with CAD, ROC assessment indicators are used to check the performance of our classifier. For each classifier, ROC will apply a threshold in the range [0, 1] to the output field.
In figure 5, the ROC analysis results for Cleveland testing dataset demonstrates that the ANN presents the better classification performance comparing to AdaBoost and DT algorithms. Where has 80% as recall value and the 70.36% as precision value. The ROC analysis results for the Hungarian database, as shown in figure 6, prove that ANN reached the 85% as best precision with 90% as accuracy value.   The Z-Alizadeh Sani database as shown in figure 7, the ROC analysis showed that the ANN method reached 98% as the best recall while the AdaBoost and the DT methods reached respectively the best recall of 90.18% and 93.18%. However, the best results of ROC were obtained when using our proposed ANN method.

Discussion and performance comparison
To assess the effectiveness of our proposed method, we conducted experiments on the aforementioned Cleveland, Hungary and Z-Alizadeh Sani databases. We compare our results with some previous work as shown in Tables 6, 7 and 8. We can see from these tables that our proposed system has better prediction performance compared to other classifiers.
In Table 6, we present the results by comparing the accuracy of the proposed system with previous work using the Cleveland database. In addition, we have used comparative study to show the performance of the proposed system. In the Cleveland analysis dataset, the proposed system showed that the ANN method can achieve a higher accuracy of 91.41% as shown in Table 6, while the accuracy of previous systems such as Weighted Fuzzy, C4.5, FDT, Neural Network, Set neural networks, HNNG, is 62.35%, 78.54%, 77.55%, 84.80%, 89.01% and 89.40%, respectively.
In the Hungarian database, the proposed system achieves a better precision of 90.00%, as shown in Table 7. Here, the previous systems (like the weighted fuzzy rules method) only obtained 46.93%, HNNG gained 87.10%. Table 7 lists the correctness of the classification of the Hungarian database. Similarly, when analyzing the Z-Alizadeh Sani dataset, our system obtained the best accuracy, as presented in Table 8. For the accuracy of HNNG achieved 93.85% compared to our system's accuracy of 94.00%. As shown in Table 8, compared to previous research, our proposed system works best for performing efficient classification and prediction.

Conclusion
In this work, we proposed an MDSS for the early prediction of atherosclerosis. Applied to datasets, the proposed system is based on three ML algorithms (ANN, AdaBoost and DT algorithms) to generate functionalities suitable for predicting patients with / without atherosclerotic disease. Using clinical data sets, a total of 835 samples were obtained from the databases in Cleveland, Hungarian and Z-Alizadeh Sani. The experimental results show that compared to other ML techniques, the ANN algorithm has better accuracy. In addition, the Accuracy, Precision, Recall and F1_Score indicators and the ROC graph are used to assess the performance of the proposed algorithm. Finally, a comparative predictive analysis is carried out between the experimental results and the different methods available in the literature (such as weighted fuzzy rules, HHNG and 2HFS). Based on common performance indicators, this comparison shows that our proposed system has the highest accuracy of 94% in predicting and classifying atherosclerosis. Like future research guidelines, the proposed system will include different methods and functions for other heart diseases to improve the accuracy of predictions.