Effects of Oversampling SMOTE in the Classification of Hypertensive Dataset

A R T I C L E I N F O A B S T R A C T Article history: Received: 15 May, 2020 Accepted: 17 July, 2020 Online: 09 August, 2020 Hypertensive or high blood pressure is a medical condition that can be driven by several factors. These factors or variables are needed to build a classification model of the hypertension dataset. In the construction of classification models, class imbalance problems are often found due to oversampling. This research aims to obtain the best classification model by implementing the Support Vector Machine (SVM) method to get the optimal level of accuracy. The dataset consists of 8 features and a label with two classes: hypertensive and non-hypertensive. Overall test result performance is then compared to assess between SVM combined with SMOTE and not. The results show that SMOTE can improve the accuracy of the model for unbalanced data into 98% accuracy compared to 91% accuracy without SMOTE.


Introduction
Hypertension or high blood pressure is a disease that can possibly lead to death. Based on the report obtained from the Center of Data and Information, Indonesian Ministry of Health, hypertensive is currently still being a major health concern with a prevalence of 25,8%. On the other hand, the implementation of database technology in the health sector continues to grow rapidly. The amount of data stored in the database is increasing and requires further processing to produce valuable information and knowledge [1] .
The field of science in which data can be processed into knowledge is called data mining. Data mining is a technique that includes a learning process from a machine or computer to automatically analyze and extract knowledge. Classification is one of the basic functions in data mining-a technique that can be used to predict membership of data groups. The process consists of finding a model (or function) that describes and distinguishes classes of data or concepts [2]. Support Vector Machine (SVM) is one method in classification that maps nonlinear input data to several higher dimensional spaces where data can be separated linearly, thus providing a large classification or regression performance [3] .
SVM works based on the principle of Structural Risk Minimization (SRM). SRM in SVM is used to guarantee the upper limit of generalization in the data collection by controlling the capacity (flexibility) of learning outcomes hypothesis [4]. SVM has been used extensively to classify several medical problems, such as diabetes and pre-diabetes classification [5], breast cancer [6] and a heart disease [7] . Based on previous study in liver disease dataset, SVM is known as the classifier compared to naïve bayes.
Meanwhile, problems with unbalanced data are often found due to oversampling which reduces data quality in model construction process. The imbalance of data lies in the unbalanced proportion of the number of categories between independent variables with large difference, thus the majority and minority data class are formed. This condition cause the classification model to be unequal in predicting the minority data class, even though this class still has importance as the object of modeling analysis [8]. Problems are found in the dataset used in this research, where the number of non-hypertensive classes is far greater than the number of hypertensive classes.
Unbalanced data handling needs to be done before modeling stage to develop a classification model with the highest degree of accuracy for all classes. Two techniques to tackle the issue of unbalanced data are Synthetic Minority Oversampling Technique (SMOTE) [9] and Cost Sensitive Learning (CSL) [10]. SMOTE balances the two classes by making systematic data for minority class, while CSL will take into account the impact of misclassification and provide data weighting [8]. This research will cover the performance identification of hypertensive dataset modeling that implements classification method with SMOTE and ASTESJ ISSN: 2415-6698 without SMOTE. The main purpose of this study aims to uncover the significance of oversampling technique implementation for unbalanced data by answering the hypothesis that the combination of the SVM classification method with SMOTE can improve the accuracy of the model.

Synthetic Minority Oversampling Technique (SMOTE)
The problem of data imbalance occurs due to a large difference between the number of instances belonging to each data class. Data classes having comparatively more objects are called major classes, while others are called minor classes [11]. The use of unbalanced data in modeling affects the performance of the models obtained. Processing algorithms that ignore data imbalances will tend to be focus too much on major classes and not enough to review minor classes [9]. The Synthetic Minority Oversampling Technique (SMOTE) method is one of the solutions in handling unbalanced data with another different principle from oversampling method that has been previously proposed. Oversampling method focuses on increase random observations, while the SMOTE method increases the amount of minor class data and make it equivalent to the major class by generating new artificial instances [12] .
There are many challenges in dealing with issues of data that are out of balance with the oversampling technique. These problems are related to the addition of random data which can cause overfitting [13]. The SMOTE method is one of the oversampling technique solutions which has the advantage of being successfully applied to various domains as shown in algorithm 1 [14]. Compute k nearest neighbors for i, and save the indices in the nnarray 10. POPULATE(N, i, nnarray) 11. end for 12. end function Artificial data or synthesis is made based on k-NN algorithm (k-nearest neighbor). The number of k-nearest neighbors is determined by considering the distance between data points of all features. The process of generating artificial data for the numerical data is different from the categorical data. Numerical data are measured by their proximity to Euclidean distance while categorical data are generated based on mode value-the value that appears most often [12]. Calculation of the distance between classes with categorical scale variables is done by the Value Difference Metric (VDM) formula, as follows: where ( 1 2 ) : distance between V1 and V2

Proposed Research Stages
This research applies a quantitative approach for a case study of hypertensive. Overall, the steps involved consisted of three parts: (1) data pre-processing, (2) building the model and (3) evaluating model performance. The methods used are SVM, ELM, over sampling and under sampling. The performance models are compared with each other. The following are the stages of completing the methodology to be completed.

Hypertension Dataset
This research is carried out using a hypertension dataset  Table 1. The dataset will be pre-processed before the modeling.

Feature Selection
In the pre-processing stage, the selection or extraction of all features of the data is carried out to get the most influential features improve the performance or accuracy of the classification model. Originally, the hypertensive dataset contains of 9 features of the hypertension dataset. The selection implemented by removing a feature "SEQN" in the first column of the dataset which has no effect and only displays the order. The label or class for this dataset is presented in "HYPCLASS" variable.
In the problem of feature selection we wish to minimize equation [15] over and α:. The support vector method attempts to find the function from the set f(x, w, b) = w . (x) + b that minimizes generalization error. We first enlarge the set of functions considered by the algorithm to f(x, w, b, ) = w . (x * ) + b. Note that the mapping (x) = (x * ) can be represented by choosing the kernel function K in equations. [16]: for any K . Thus for these kernels the bounds in Theorems still hold. Hence, to minimize T(σ,α) over α and σ we minimize the wrapper functional T_wrap in equation where Talg is given by the equations choosing a fixed value of σ implemented by the kernel. Using equation one minimizes over σ: , subject ∑ = 1, ≥ = 0, = 1, … , ℓ, 2 ( 0 , ) is defined by the maximum of functional using kernel. In a similar way, one can minimize the span bound over instead of equation.
Finding the minimum of 2 2 over requires searching over all possible subsets of n features which is a combinatorial problem. To avoid this problem classical methods of search include greedily adding or removing features (forward or backward selection) and hill climbing. All of these methods are expensive to compute if n is large.
As an alternative to these approaches we suggest the following method: approximate the binary valued vector ∈ {0,1} with a real valued vector ∈ ℝ . Then, to find the optimum value of one can minimize 2 2 , or some other differentiable criterion, by gradient descent. As explained in the derivative of our criterion is: We estimate the minimum of ( , ) by minimizing equation in the space ∈ ℝ using the gradients with the following extra constraint which approximates integer programming. 2 2 ( ) +⋋ ∑( ) (7) subject to ∑ = , ≥ 0, = 1, … , ℓ..
For large enough ⋋ , as p -> 0 only m elements of will be nonzero, approximating optimization problem ( , ). One can further simplify computations by considering a stepwise approximation procedure to find m features. To do this one can minimize 2 2 ( ) with unconstrained. One then sets the q « n smallest values of 0" to zero, and repeats the minimization until only m nonzero elements of remain. This can mean repeatedly training a SVM just a few times, which can be fast.

Results
Hypertensive data are classified into hypertensive (1) and nonhypertensive (0) classes. 80% of available data is allocated for training set andremaining 20% for the test set. Data validation is done by the split validation method by dividing three times the test data and three times the training data. Detailed description will be explained in the following subsection.

Data Visualization
Data visualization is a technique used to communicate data or information in the form of visual objects. In this section, visualization data done by using Python will be displayed in the form of bar graphs. Figure 2 shows the original results of two classifications in hypertension dataset-consists of 7399 hypertensive classes and 17035 non-hypertensive classes. Based on the graph, it can be concluded that the hypertensive class is having comparatively fewer objects than non-hypertensive class.

Figure.2. Proportion between Hypertensive and Non-Hypertensive
The following graph in Figure 3 shows that the sample of hypertensive patients has greater number in male gender, otherwise non-hypertensive data are dominated with female.

SVM Model and K Fold Cross Validation
The pre-processed data is then used in building the model with the Support Vector Machine (SVM) method. Validation on the SVM classifier model uses the model that has been built with 2 classes of hypertensive and non-hypertensive, in which the value of k = 10 for the K-Fold Cross Validation method. The validation results correspond to the optimal accuracy based on the K-Fold method. Cross Validation rises slightly from initial experiment, where the SVM classifier model has the highest accuracy in the 5th iteration with the highest accuracy value of 95% (average value = 90.2%). The assessment method using the SVM classification method uses the 10-Fold Cross Validation method as presented in Table 1.

SMOTE Implementation
Based on the Figure 3 in the previous sub-section, imbalance data are found in hypertensive class (minor class) which is having comparatively fewer objects than non-hypertensive class (major class). In this experiment, oversampling will be carried out on the minority with the Synthetic Minority Over-sampling Technic (SMOTE) method which is a popular method applied in order to deal with cl class ass imbalance. [9] This technique synthesizes new samples generated from minor class to balance the dataset. New instances of the minor class obtained by forming convex combinations from neighboring instances. Through the number of n_sample = 12000, n_features = 2, n_split = 7 and n_repeats = 4, the accuracy obtained from SVM model is increased to 98%. Figure 5 shows the comparison the value of class 0 (non-hypertensive) and class 1 (hypertensive) after implementing SMOTE method.

Figure.5. Hypertensive dataset experiment with SMOTE
The following is how the SMOTE algorithm works Step 1: Setting the minority class set A, for each ∈ , the knearest neighbors of x are obtained by calculating the Euclidean distance between x and every other sample in set A.
Step 2: The sample rate N is set according to the imbalanced proportion. For each ∈ , N examples (i.e x1, x2, …xn) are randomly selected from its k-nearest neighbors, and they construct the set 1 .

Performance of SMOTE and Non-SMOTE Classification Results
Confusion Matrix is used to measure the performance of SVM with SMOTE and SVM performance without SMOTE to classify hypertensive. Then the calculation of accuracy, precision, and recall values is done by calculating the average value of accuracy, precision and recall in each class as shown in Table 3. From these results, it can be analyzed the effect of SMOTE on the performance of the SVM classification algorithm. The graphic representation of classification model performance result for hypertensive data using SVM classifier with 10-Fold Cross Validation and SMOTE and thus without SMOTE is presented in Figure 6. The results using the combination of SVM and SMOTE, outperformed the SVM classification without SMOTE. The average accuracy based on SVM classifier with SMOTE is higher (98%) compared to SVM classifier without SMOTE (91%).

Conclusion
The SVM classification method with a K-Fold Cross Validation resulted on the average of 90.2% of accuracy. SVM is known as a classification method with good prediction results. Based in the experiment conducted in this research, the resulting model is increasing to optimal after running imbalanced dataset using SMOTE with the average of 98% accuracy results.