Medical imbalanced data classification

In general, the imbalanced dataset is a problem often found in health applications. In medical data classification, we often face the imbalanced number of data samples where at least one of the classes constitutes only a very small minority of the data. In the same time, it represent a difficult problem in most of machine learning algorithms. There have been many works dealing with classification of imbalanced dataset. In this paper, we proposed a learning method based on a cost sensitive extension of Least Mean Square (LMS) algorithm that penalizes errors of different samples with different weights and some rules of thumb to determine those weights. After the balancing phase, we apply the different techniques (Support Vector Machine [SVM], K-Nearest Neighbor [K-NN] and Multilayer perceptron [MLP]) for the balanced datasets. We have also compared the obtained results before and after balancing method. We have obtained best results compared to literature with a classification accuracy of 100%.


Introduction
Learning algorithms from imbalanced data has attracted a significant amount of interest in recent years. This is because in real world, imbalanced data exist in many applications, such as fault diagnosis [1], medical diagnosis [2], intrusion detection [3,4], text classification [5,6], financial fraud detection [7], data stream classification [8], and soon. In those applications, there are often one or some minority classes possessing very few samples compared with the other classes. And most of time, the "small" classes are more important than those "large" ones. Because of the unbalance data distribution of imbalanced learning problems, it is often difficult to obtain good performance for most cases by using traditional classifiers where a balanced distribution of classes is assumed and an equal misclassification cost for each class is assigned. As a result, traditional classifiers tend to be overwhelmed by the majority classes and ignore the minority ones, which is not acceptable in many real applications [9,10].
Most previous works focused on the binary classification problems [11]. The others [12,13] also tried to employ the multiclass data and define the class with a small number of data as the minority class while the other data are merged in to the majority class. Al though the minority class can be recognized by classifiers, the artificial majority class might be more likely to be misclassified. The knowledge of imbalanced data is complex especially when we solve the multi-class problems, since the amounts of some data classes are the same or similar to each other, which increases the difficulty to artificially select the minority class. The imbalanced learning problems can be summarized as two categories: absolute imbalance and relative imbalance [14]. The absolute imbalance occurs in the situation when the minority instances are significantly scarce and implicit, whereas the dataset with relative imbalance can show explicit data distribution but still rare quantity for minority examples. The characteristic of rare instances exists in the typical imbalance where the limited representative data lead to difficult learning regard less of between class imbalances. The other form of imbalance is within-class imbalance. It concentrates on the representative data distribution for the sub-concepts with in a class. The within-class imbalance problem seems to be more difficult than the datasets with the concepts in a similar characteristic [15,16].
The works to be cited in Section 2 clearly show that most techniques cited in the literature were not able to find the best effective ways to address minority data.
In this paper, the learning method based on a cost-sensitive extension of Least Mean Square (LMS) algorithm is proposed to solve the imbalanced learning problems, and that penalizes errors of different samples with different weights, which increase the ASTESJ ISSN: 2415-6698 classification rate. In order to validate our empirical study, we have chosen three different algorithms from different paradigms of data mining, including Multilayer Perceptron (MLP), Support Vector Machines (SVMs) and the K-Nearest Neighbour (K-NN) as an Instance-Based Learning approach. We have also compared the results obtained before and after balancing the different datasets by the adopted LMS.
The rest of the paper is organized as follows. State of Art is presented in Section 2, it reviews several techniques applied for problems with imbalanced datasets. Next, section 3 presents the different techniques (MLP, SVM and K-NN) and our proposed method (LMS). In Section 4, the experimental work is presented also the obtained results are discussed and compared to other works in literature. Finally, Section 5 concerns conclusions and outlines possible directions for future research.

State of art
A variety of solutions has been proposed to address the imbalanced learning. To understand this issue comprehensively, most of the state of the art methods are generalized as the following categories. A critical and comprehensive survey on imbalanced learning can be found in [17].
Random oversampling for minority instances and under sampling for majority instances can facilitate change of the distribution for original dataset [18]. The data formed under sampling using K-Nearest Neighbor (K-NN) [19] is also presented. To overcome the disadvantages of the basic sampling methods, Such as, risk of overfitting for oversampling approach and risk of a loss of information for undersampling method, the Synthetic Minority Oversampling Technique (SMOTE) [20] is used. It selects one from the nearest neighbors for each original minority example, and generate synthetic minority data, based on the linear interpolations between the original examples and randomly selected nearest neighbors. Borderline Synthetic Minority Oversampling Technique (Borderline-SMOTE) [21] generates only synthetic data for the minority instances near the border rather than every original minority instance. Adaptive Synthetic (ADASYN) [22] is proposed to adaptively create the different quantities of synthetic data corresponding to the density distribution. Parallel Selective Sampling (PSS) technique [23] is proposed to select data from the majority class to reduce imbalance in large datasets. The PSS is a filter method, which can be combined with the Support Vector Machine (SVM) classification. The PSS-SVM showed excellent performances on synthetic datasets, much better than SVM. Other sampling strategies are integrated with ensemble learning techniques [24,25] to address the imbalanced learning issue. The Synthetic Minority Oversampling Technique in boosting (SMOTEBoost) [26] algorithm is achieved via combining SMOTE with Adaboost.M2 (Adaptive Boosting.M.2.). Ranked Minority Oversampling in boosting (RAMOBoost) [27] adjusts the sampling weights of minority class examples based on the data distributions [16]. Other weighting approaches are proposed to overcome the problem of imbalanced datasets. The Least Mean Square (LMS) [28] algorithm is proposed to penalise errors of different samples with different weights and some rules of thumb to determine those weights. After the balancing phase, different classifiers (Support Vector Machine [SVM], K-Nearest Neighbour [K-NN] and Multilayer Perceptron [MLP]) are applied for the new balanced dataset. In addition, the results obtained by the LMS method are compared with the results obtained by the sampling methods (Under-sampling, Oversampling and SMOTE). Other local strategies are proposed to address the within-class imbalance issue of positive data sparsity, by directly adjusting the induction bias of specificity-oriented learning algorithms. The k Rare-class Nearest Neighbour (KRNN) algorithm [29] is proposed, where dynamic local query neighbourhoods are formed that contain at least k positive nearest neighbours and the positive posterior probability estimation is biased towards the rare class based on the size and positive distribution in local regions.
The goal of cost-sensitive learning [30][31][32][33] is to calculate the costs for misclassification through different cost matrices. The Adaptive Cost sensitive boosting (AdaCost) [34] adopts the costsensitive learning with boosting. Cost-sensitive decision tree [35] can prune the scheme for imbalanced data with misclassification costs through specifying decision threshold. Cost-sensitive neural network models [36,37] are also widely applied for imbalanced learning [16].
The kernel-based learning approaches include many state-ofthe-art techniques for the application of data mining domain [38][39][40][41]. A Granular Support Vector Machines-Repetitive Undersampling (GSVM-RU) algorithm [42] carries out the iterative learning procedure based on GSVM. Kernel-Boundary Alignment (KBA) [43] is proposed to modify the kernel matrix via a kernel function based on the distribution of imbalanced data. There is another typical kernel-based learning algorithm for maximizing Area Under Curve (AUC) of the Receiver Operating Characteristic (ROC) graph [16,44].
The active learning methods [45][46][47] are traditionally adopted to handle the special issues relevant to training data without class labels (unlabeled data). As mentioned in [48], the criteria of termination for active learning methods are investigated to apply for the class imbalance issues on Word Sense Disambiguation (WSD) through maximal confidence and minimal error [16].

Materials and Methods
A brief description of the used algorithms is reported below:

Classification techniques used
In this work, we have used a K-Nearest Neighbor (K-NN) as a statistical machine, a Support Vector Machine (SVM) as a kernel machine, and a Multi-Layer Perceptron (MLP) as a neural network. Brief descriptions of these algorithms are already reported in literature [49].

Least Mean Square algorithm
The Least Mean Square (LMS) algorithm which is also called the stochastic gradient algorithm is relatively easy to implement and is based on a simple concept, it was introduced by Widrow and Hoff in 1960 [50].
The LMS algorithm is an adaptive algorithm, which uses a gradient-based method of steepest decent. LMS algorithm uses the estimates of the gradient vector from the available data. LMS incorporates an iterative procedure that makes successive corrections to the weight vector in the direction of the negative of the gradient vector, which eventually leads to the minimum mean square error.
Compared to other algorithms LMS algorithm is relatively simple; it does not require correlation function calculation nor does it require matrix inversions [51].
In the LMS algorithm, the squares of mean quadratic errors are minimized by solving a system of linear equations. In this paper, to remedy the problem of the learning of the imbalanced dataset, we used a cost-sensitive extension of Least Mean Square algorithm that penalizes errors of different samples with different weights.
The solution for the least mean square algorithm classification can be found by solving the following constrained minimization problem: The LMS algorithm is probably the most popular adaptive algorithm that exists because of its simplicity.
From the method of steepest descent, the weight vector equation is given by: Where μ is the step-size parameter and controls the convergence characteristics of the LMS algorithm; 2 is the mean square error between the beam former output and the reference vector which is given by, The gradient vector in the above weight update equation can be computed as In the method of steepest descent, the biggest problem is the computation involved in finding the values P and R matrices in real time. The LMS algorithm on the other hand simplifies this by using the instantaneous values of covariance matrices P and R instead of their actual values i.e. ̂ = (6) ̂ = These are simply the estimated instantaneous correlations. Therefore, the weight update can be given by the following equation, Note that is a random variable [since each new iteration i, depends random processes of and .
Therefore, the LMS algorithm can be summarized in following equations [53]: The LMS algorithm is initiated with an arbitrary value w(0) for the weight vector at i = 0. The successive corrections of the weight vector eventually leads to the minimum value of the mean squared error.
μ is the step-size parameter and controls the convergence characteristics of the LMS algorithm : • If μ is chosen to be very small then the algorithm converges very slowly.
• A large value of μ may lead to a faster convergence but may be less stable around the minimum value.
The LMS algorithm is very simple: it requires only 2L + 1 multiplications and 2L additions by iteration, where L is the number of filter coefficients.

Medical datasets
We have used five medical datasets from UCI database [54]. In order to validate the proposed methods on each one, we chose a subset of these datasets providing a heterogeneous test bench. These five datasets are Pima Indian Diabetes, Wisconsin Breast Cancer (WBC), Wisconsin Diagnostic Breast Cancer (WDBC), Liver disorder and Appendicitis. The main characteristics of these datasets are depicted in Table 1.

Employed classifiers
In this subsection, we describe how we adjust some parameters of these techniques and how we estimate the classification reliabilities.
The K-NN algorithm requires no specific set-up. We test values of k {1, 3, 5, 7} and we choose the value providing the best performances on a validation set according to a fivefold cross validation. We estimate the reliability of each classification act on the basis of information directly derived from the output of the expert and analyzing also the reasons in the feature space giving rise to unreliable classification. For further details, we may refer to [49,55].
We test a SVM algorithm with a Gaussian radial basis kernel. Values of regularization parameter C and scaling factor σ are selected within intervals [1; 10 4 ] and [10 −4 ; 10], adopting a log scale to sample the two intervals. The value of each parameter is tuned using a fivefold cross-validation on a validation set. The reliability of a SVM classification is estimated as proposed in [56], where the decision value of the classifier is transformed in a posterior probability [49].
We use a MLP algorithm with a number of hidden layers equal to half of the sum of features number plus class number. The number of neurons in the input layer is fixed by the number of the features whereas we chose two neurons in the output layer. The reliability is a function of the values provided by neurons in the output layer [49,55].

Statistical metrics
To assess the predictive ability of constructed models, five statistical evaluation methods were employed and they are defined as follows:  Gmean is considered as a measure of the balanced accuracy and is defined as: Gmean = √Sensitivity × Specificity (13)

Statistical performance of different classifiers combined with LMS
In order to validate the influence of LMS on the different classifiers (MLP, SVM, K-NN), it is interesting to compare the performances of MLP, SVM and K-NN techniques in the two cases (with and without LMS). However, selecting objective statistical metrics are used to estimate the performance of different classifiers. Indeed, for the imbalanced classification problem, the overall classification accuracy is often not an appropriate measure of performance given that a trivial classifier that predicts every sample as the majority class could achieve very high accuracy in extremely skewed domains. In the present work, instead of the complicated metrics, five intuitive and practical measures (correct classification rate, error rate, Sensitivity, Specificity and Gmean) were adopted to estimate the current classifiers based on the following reasons: first, both Sensitivity and Specificity provide a class-by-class performance estimate, making one easily investigation on the predictive ability of a classification method for each sample class, especially the predictive ability for the interesting minority classes; second, Gmean is a combination of both Sensitivity and Specificity, which indicates the balance between classification performance on the majority and minority classes. A poor performance in prediction of the positive (interesting) samples still leads to a low Gmean value, even if the negative samples are classified with high accuracy, which is a common case for imbalanced dataset. The comparative study results are summarized in table 2.
The classification of the different imbalanced databases used in this work involves four steps: • Step 1: application of the different techniques (SVM, K-NN and MLP) on imbalanced data. • Step 2: application of the LMS algorithm to remedy the imbalance of data. • Step 3: application of the different techniques (SVM, K-NN and MLP) on obtained balanced data.
• Step 4: comparison between results obtain successively in first and second step.
We notice from these experiments that the classification performances (CC, SE, SP, and Gmean) increase after balancing databases by using the least mean square algorithm.
We remark that before balancing the different datasets, the minority class is hardly recognized by the different classifiers (MLP, SVM and K-NN). However, after balancing these imbalanced databases, the performance is improved significantly with the employment of LMS algorithm as illustrated in table 2; by increasing the Sensitivity, the specificity, the correct classification rate, and the Gmean. Therefore, we have obtained the best classification performances. We can say that the classifiers have a good recognition of the minority classes and the majority classes, since in our experimentations the samples of the minority classes and the majority classes are correctly classified (TP and TN will increase / FN and FP will decrease after balancing).

Behavior of descriptors before and after balancing approach
To validate the influence of LMS on the different techniques (MLP, SVM and K-NN), we compare the values of descriptors before and after balancing. So we take a miss-classified case from the minority class before balancing the different databases (PIMA, WBC, WDBC, liver disorder and Appendicitis); and we apply the LMS algorithm, where each descriptor is weighted by a coefficient, and the same case was correctly classified (see figure1).
We notice also from this figure1 (a) that some descriptors in PIMA dataset remain unchanged (D1, D5, D7) however the rest has changed in a certain percentage which enhance the importance of the attributes. Also in the other used databases (WBC, WDBC, liver disorder and Appendicitis) we have obtained some changes in the different descriptors (figure1 (b), (c), (d), (e)).

Comparative study with related works
In this section, we have compared the classification accuracies of our method with other methods applied to the same database: Table 3 gives the classification accuracies of our method and other methods applied on the PIMA database.

Table.3 classification accuracies obtained with our method and other classifiers in literature (PIMA)
L. Gonzalez-Abril and al. have proposed a new Support Vector Machine method (called GSVM), which is specially designed for bi-classification problems its objective was balanced accuracy between classes [57]. For the evaluation of the results, (L. Gonzalez-Abril and al.) have used many databases (23) and obtained an accuracy of 74.15% for Pima dataset. Y. Shao and al. proposed an efficient Weighted Lagrangian Twin Support Vector Machine (WLTSVM) for the imbalanced data classification, they use different training points for constructing the two proximal hyperplanes [58], they achieve 76.78 ±0.35% of accuracy. In this work, as can be seen from the results (Table. 3), our method (MLP with LMS, SVM with LMS and K-NN with LMS) gave excellent classification accuracy.

Method
Classification Accuracy (%) S-AIRS [59] 96.91 WLTSVM [58] 96.30±0.31 MLP with LMS 99.56 SVM with LMS 99.12 K-NN with LMS 100 Wang and Adrian proposed a hybrid method by combining Synthetic Minority Over-Sampling Technique (SMOTE) and Artificial Immune Recognition System (AIRS) to handle the imbalanced data problem that are prominent in medical data . This approach denoted as S-AIRS [59]. They obtain 96.91% accuracy. Y. Shao and al. proposed WLTSVM [58] and they achieve 96.30± 0.31% of accuracy. In this study, as can be seen from the results (Table. 4), our approach obtain the best classification accuracy with the different classifiers. Table 5 gives the classification accuracies of our method and other methods applied on the WDBC database.

Table.5 classification accuracies obtained with our method and other classifiers in literature (WDBC)
Method Classification Accuracy (%) S-AIRS [59] 96.52 K-NN with resampling [60] 98.42 MLP with LMS 100 SVM with LMS 100 K-NN with LMS 100 Wang and Adrian proposed a hybrid method S-AIRS [59]. Their approach obtained 96.52% accuracy. G. NAGA RAMADEVI and al. applied the five classifiers K-NN, SVM, Logistic Regression, C 4.5 and Random Forest on original four breast cancer datasets with and without resampling technique, they compare the obtained performances before and after resampling datasets [60]. They obtain the best accuracy with 98.42% by using K-NN and resampling method. In this work, as can be seen from the results (Table. 5), our approach obtain the best classification accuracy. Table 6 gives the classification accuracies of our method and other methods applied on the Liver disorder database. Alberto Cano and al. proposed an algorithm called weighted Data Gravitation Classification (DGC+) that compares the gravitational field for the different data classes to predict the class with the highest magnitude. The proposal improves previous data gravitation algorithms by learning the optimal weights of the attributes for each class and solves some of their issues such as nominal attributes handling, imbalanced data performance, and noisy data filtering [61]. They achieve 67.44% of accuracy. L. Gonzalez-Abril and al. proposed GSVM method [57]; they obtained an accuracy of 71.07%. In this work, as can be seen from the results (Table. 6), our approach obtain the best classification accuracy with the different classifiers. Table 7 gives the classification accuracies of our method and other methods applied on the appendicitis database.

Table.7 classification accuracies obtained with our method and other classifiers in literature (Appendicitis)
Method Classification Accuracy (%) DGC+ [61] 84.09 BSMAIRS [62] 92.5926 MLP with LMS 100 SVM with LMS 100 K-NN with LMS 94.29 Alberto Cano and al. proposed a DGC+ algorithm [61] and they achieve 84.09% of accuracy. Kung-Jeng and al. developed a hybrid classifier approach, they combine Borderline Synthetic Minority oversampling technique (BSM) and Artificial Immune Recognition System (AIRS) as global optimization searcher with the nearest neighbor algorithm used as a local classifier. This approach denoted as BSMAIRS. For the evaluation of the results, Kung-Jeng and al. have used a fivefold cross validation strategy and they have obtained five accuracies; the best one obtained with 92.5926% [62]. In this study, as can be seen from the results (Table.7), our approach obtains an excellent classification accuracy.

Conclusion
In this paper, we proposed a learning method based on a cost sensitive extension of least mean square algorithm that penalizes errors of different samples with different weights. This approach is used to overcome the problem of imbalanced data, it gives high weights for the samples of the minority classes.
The proposed approach was applied on five medical datasets from UCI database to assess its performance. Experimental results revealed that LMS algorithm performed better (achieved higher performance values) than the other balancing methods. It shows clearly the advantage of LMS when we handle the imbalanced data. Moreover, the results showed that the combination of LMS with different techniques (MLP, SVM and K-NN) can enhance classifier performance, particularly in terms of accuracy.
We propose that our approach will be applied for multiclass datasets, also the LMS algorithm will be tested on other intelligent methods based techniques such as fuzzy logic in order to increase the interpretability of the results. Also we can extend the ratio of the minority class to majority class in order to do a study in this situation. Our approach will be proposed to overcome the disadvantages of the basic sampling methods. Because the LMS algorithm that penalizes errors of different samples with different weights. However, this approach does not eliminate the instances of the majority class and does not add the instances of the minority classes. We can conclude that, this method keeps the same database.