Improved Nonlinear Fuzzy Robust PCA for Anomaly-based Intrusion Detection

the approaches is and a in


Background
Thanks to the major shift in technology tools in the twenty first century, the complexity of network security has greatly increased, which gave birth to highly developed attacks. There are several traditional security methods like firewalls, data encryption and user authentication. Those techniques are insufficient to protect the network systems against all the existing threats. As a consequence, they may be less effective in detecting several dangerous attacks. Therefore, we need to strengthen our systems by adding more powerful systems such as intrusion detection system (IDS). The IDS protects the network systems by preventing the eventual damages that could be caused by an intrusion. Commonly, there is 2 principal categories of IDS, misuse-based and anomaly-based. The misuse-based method aim to classify an attack via comparing its signature with the attacks currently existing in a database of signatures of attacks and produce an alarm if any malicious activity is detected. The most two well-known misuse detection methods are STAT [1] and Snort [2]. This technique has proved its effectiveness in detecting the attacks stored in the datasets but it can not detects new intrusions or attacks and maintaining the databases is very expensive. Hence anomaly-based detection was initiated by Anderson [3] & Denning [4], the fundamental idea behind this concept is to specify the normal behaviour or model and generate an alarm if the difference between an observation and the defined model surpasses a threshold already defined. The uniqueness of this concept is its capability to categorize new and unusual intrusions.
Nonetheless, the current network traffic data, which are usually tremendous, are in fact a big challenge to anomaly based IDS. This type of traffic may decrease the whole detection mechanism and lead frequently to a falsified classification accuracies. This kind of huge dataset often have redundant and noisy data, that can be very difficult to model.
To address that, many feature extraction techniques have been used to increase network IDS efficiency. For instance, the paper [5] used a Discrete Differential Evolution to recognize the most important features. The detection accuracies were enhanced significantly. Likewise, the work [6] presented an IDS that can detect several attacks by exploring just a small number of features, the algorithm utilized is called Ant Colony Optimization algorithm. In [7] the authors used a feature selection method called cuttlefish which suppress the noisy and redundant data and simultaneously guarantee the quality of data. The authors of [8] proposed to utilize the Principal Component Analysis (PCA) and Fuzzy Principal Component Analysis (FPCA) as a pre-processing step, before applying the k nearest neighbor (KNN) classifier, the same authors suggest in another publication [9] an improved variant method called Robust Fuzzy PCA. The acquired results show the promising performance of the technique proposed with regard to network attacks detection, as well as false alarms reduction.
Neverthless, PCA [10]- [12] and its linear variants are known to be sensitive to noise and outliers, which can impact on the deriving principal component (PCs) [13], therefore they effect as well the results of classification. In addition to that, PCA allows uniquely a linear dimensionality reduction [14]. Hence, in the case of complicated structures like nonlinear structures, the data will not be correctly expressed in a linear space, linear variants of PCA will not be the optimal solution. To deal with this issue, NFRPCA (Nonlinear Fuzzy Robust PCA) [15] was suggested to calculate PCs by utilizing a non-linear technique.
Nevertheless, this method is based on the L 2 -norm that is highly sensitive to outliers and it also squares the error, and so the model can have a much bigger error. So as to tackle this issue, we introduce a new variant of NFRPCA called L p -norm NFRPCA. Note that this paper is an extension of work originally published in 2018 IEEE 5th International Congress on Information Science and Technology (CiSt) [16], in that work we suggested a variant of NFRPCA employing L 1 -norm rather than the classical Euclidian norm. In this paper, we propose another variant of NFRPCA using L p -norm, we conducted new experiments besides the ones previously proposed in [16].
The remainder of this paper is structured as follows: Section 2 deliver a brief presentation of PCA, Section 3 will present an overview of NFRPCA. We present the suggested techniques in Section 4, Section 5 is dedicated to give shortly an overview of the two popular datasets namely KDDcup99 and NSL-KDD. Section 6 presents the conducted experiments and discuss the results and conclusions are summarized in section 7.

Principal Component Analysis Method
Principal Component Analysis (PCA) [17] is as an exploratory data analysis tool that involves a dataset with observations on variables, it was employed extensively in several research areas. The principal concept of PCA is to change data into a downsized form and preserve most of the initial variance from the original data at the same time. In other terms, PCA major role is to change variables n that were correlated into uncorrelated state d, the uncorrelated variables d are usually called the principal components (PCs) [14] [18].
Consider we already have a data set of M vectors v 1 , v 2 , v 3 , ...., v M where each vector is represented by N features. To obtain the PCs we comply to the strategy explained through the steps underneath: • Compute the mean µ of the data set • Determine the deviation from the mean • Compute the covariance matrix of the corresponding data set: where A = [θ 1 , θ 2 , θ 3 , ....., θ n ] • Assume that U k is the k th eigenvector of C, λ k the related eigenvalue and let's say that the U n * d = [U 1 , U 2 , ...., U d ] is the matrix of these eigenvectors. Hence: • Sort the Eigenvalues in decreasing order, hence pick the eigenvectors (known as principal components PC i )that have the largest Eigenvalues. We can compute the number of PCs that we could keep as follow: • Consider t as a new sample column vector, t is projected on the reduced subspace covered by the PC i :

Nonlinear Fuzzy Robust pca (nfrpca) Method
The Nonlinear fuzzy robust principal component analysis employed in the current paper was suggested initially by Luukka in [15]. It was inspired from the robust principal component techniques that Yang & Wang proposed in [19] that fundamentally introduced by Xu & Yuilles techniques [20] where PCA learning rules are associated to energy functions and they proposed a cost function in regard to outliers. In Yang & Wang's proposed methods the cost function was modified to be fuzzy and it involved Xu & Yuilles techniques as a particular case. We introduce briefly these methods in this section. Xu and Yuille [20] suggested an objective function, based on u i ∈ {0, 1}: The variables are defined as follows: η is the threshold, It should be noted that w is a continuous variable and u i is a binary www.astesj.com 250 variable that engender optimization challenging to resolve with a gradient descent technique. To solve the problem employing the gradient descent technique the minimization problem is simplified into maximization of Gibbs distribution as underneath: Where Z is the partition function confirming U w P(U, w) = 1.
The measure e(x i ) might be, e.g. one of the below functions To minimize E 1 = n i=1 e 1 (x i ) and E 2 = n i=1 e 2 (x i ), the gradient descent rules are Where y = w T x i , u = yw ,v = w T u and α t is the learning rate. The nonlinear case of PCA was presented as underneath : And y = x i w and g could be selected as nonlinear functions. The weight updating in this situation is And: F = d dy (g(y)).
The cost function proposed by Yang and Wang: subject to u i ∈ [0, 1] and m 1 ∈ [1, ∞). Now u i is the membership value of x i associated to the data cluster and (1 − u i ) is the membership value of x i associated to the noise cluster and m 1 is the fuzziness variable. And e(x i ) is the error between the class center and x i . As u i is a continuous variable now, the complexity of an amalgamation of continuous and discrete optimization could be obviated and the gradient descent technique can be employed.
The gradient of E(15) is calculated with regard to u i .
By choosing δE δu i = 0, we obtain Replacing this membership back, we get The gradient with regard to w would be where m 1 is the fuzziness variable. If m 1 = 1, the fuzzy membership downsizes into hard membership and could be picked following the concept: right now η is a though threshold in this situation. The setting of m 1 has no rule. We sum up NFRPCA steps in Algorithm1.

Algorithm 1 NFRPCA algorithm
Step 1: At first set the count of iteration t = 1, bound of iteration T , learning coefficient α 0 ∈ (0, 1] soft threshold η to a very small positive value then initialize in random way the weight w.
Step 2: As long as t is smaller than T , do steps 3-9.
Step 4: As long as i is smaller than n, do steps 5-8 Step 5: Calculate y = w T x i , u = yw and v = w T u.
Step 6: Calculate g(y), F = d dy (g(y)), e 3 (x i ) = x i − w old g(y), then the weight is updated: w new = w old + α t (x i e T w old F + e 3 (x i )g(y)) In [15] g(y) was selected to be a sigmoid function such as g(y) = tanh(10y), F is the first derivative of g(y).

Nonlinear Fuzzy Robust L 1 -norm PCA method
We can clearly remark that, NFRPCA technique utilize an Euclidian norm for computing the reconstruction error in (13), and it is widely recognized that the Euclidian norm usually squares the reconstruction error, thus the approach have a much bigger error. Consequently, this could falsify the results and deteriorate the quality of solutions. For the sake of addressing this issue, the paper [16] suggest utilizing L 1 -norm to compute the reconstruction error. The reconstruction error equation could be re-written as below: y = x i w and g could be picked as nonlinear functions. Here the weight updating is www.astesj.com Where :F = d dy (g(y )). Equally as in algorithm 1, we utilize updating weight to compute the PCs.

Algorithm 2 L 1 -NFRPCA algorithm
Step 1: At first set the count of iteration t = 1, bound of iteration T , learning coefficient α 0 ∈ (0, 1] soft threshold η to a very small positive value then initialize in random way the weigh w.
Step 2: As long as t is smaller than T , do steps 3-9.
Step 4: As long as i is smaller than n, do steps 5-8 Step 5: Calculate y = w T x i , u = y w and v = w T u .
Step 6: Calculate g(y ), F = d dy (g(y )), e 3 (x i ) = x i − w old g(y ), then the weight is updated: w new = w old + α t (x i e T w old F + e 3 (x i )g(y )) In the proposed algorithm g(y ) was picked to be sigmoid like the function g(y ) = tanh(10y ), & F is the first derivative of g(y ). We use the term L 1 -norm NFRPCA to refer to this technique in the rest of this paper.

Nonlinear Fuzzy Robust L p -norm PCA method
The technique that we suggest is L p -norm NFRPCA where we propose to substitute the L 2 -norm with generalized L p -norm in the computation of the reconstruction error, in order to minimize the large error that L 2 -norm can cause. The reconstruction error equation would be re-written as follow: Where 0 > p >= 2. And y" = x i w" and g could be selected as nonlinear function. Here the weight updating is w new" = w old" + α t (x i e T " w old" F" + e " 3 (x i )g(y")) (23) Where: F" = d dy" (g(y")). In the same way as in algorithm 1, we use the updating weight to compute the PCs. Finally the main steps of the proposed method is summarized in Algorithm 3.

Algorithm 3 L p -norm NFRPCA algorithm
Step 1: At first set the count of iteration t = 1, bound of iteration T , learning coefficient α 0 ∈ (0, 1] soft threshold η to a very small positive value then initialize in random way the weight w".
Step 2: As long as t is smaller than T , do steps 3-9.
Step 4: As long as i is smaller than n, do steps 5-8 Step 5: Calculate y" = w "T x i , u" = y"w" and v" = w "T u".
Step 6: Calculate g(y"), F" = d dy" (g(y")), e" 3 (x i ) = x i − w" old g(y"), then the weight is updated: w" new = w" old + α t (x i e" T w" old F" + e " 3 (x i )g(y")) The function g(y") was picked to be sigmoid function, such as g(y") = tanh(10y"), and F" is the first derivative of g(y"). We use the term L p -norm NFRPCA to refer to this technique in the rest of this paper.

Datasets
In the experiments we conducted, the popular public intrusion data sets were used, they are called KDDcup99 and NSL-KDD. We present them shortly in the next subsections.

KDDCup99 Dataset
KDDCup99 [21,22] is a dataset that contains many TCPdump raws, that was captured during 9 weeks. This dataset was prepared and still conducted by Intrusion Detection Evaluation Program called DARPA. The main aim of KDDCup99 is to establish a generalized data set to examine researchs in intrusion detection field. The training data set represents 4 gigabytes of data compressed (most of it is binary TCP dump), it contains 4,898,431 records, while the test dataset contains around 311,029 connection records and each record in dataset contains exactly 41 features. The attacks existing in KDDCup99 are categorized as underneath: • Denial of Service (DoS): cyber-attack where the perpetrator attempt to consume network or machine resources to make it unavailable or limited to its legitimate users.
• Remote to Local (R2L): Where the attacker control the remote machine pretending that he is a user of the system, by exploiting the system vulnerabilities.
• User to Root (U2R): by exploiting the vulnerabilities and the flaws existing in a machine or on a network, the hacker tries to start accessing from a normal user account as to get the root access privilege to the system.
• Probing: The intruder tries to collect useful information of all services and machines existing in the same network to exploit it later.

NSL-KDD Dataset
The NSL-KDD [23] data set has been created to alleviate a couple of the major shortcoming of the KDDCup99 data set. This new version has some improvement compared to KDDCup99 data set and has solved a few of its fundamental issues. The main advantages of NSL-KDD are as follow: • It suppressed redundant instances in the trainining set.
• The test dat set does not contain replicate records. www.astesj.com • The current version allow us to use the whole data set. Correspondingly, the random selection method will be needless because of the reduction of the number of instances in both train set and test set. Consequently, the accuracy and consistency of the evaluation and review of research works will increase.
• Every complexity level group involve a couple of instances that is oppositely corresponding to the percentage of instances in the KDDCup99 dataset. Therefore, we obtain more realistic examination of various machine learning techniques.

Normalization Phase
The normalization phase is more than important, because it allow us to apply the techniques cited above on the data set in a correct manner. To perform effectively this phase, we replaced the discrete values with continuous values for all the discrete attributes existing in the data sets through the same process used in [10], the process is briefly clarified as follow: for every attribute y that acceptes x variant values. The attribute y is illustrated by x coordinates contains zeros and ones. E.g., the attribute of the protocol type, that acceptes three, values i.e. tcp, udp or icmp. Utilizing this logic, these values are modified to the following coordinates (1,0,0), (0,1,0) or (0,0,1).

Experiments and Discussion
In the current part of this paper, several experiments were performed to examine the efficiency of the suggested method utilizing KDDcup99 and NSL-KDD databases, for the sake of evaluating the effectiveness of the suggested technique. We compute the measures: detection rate (DR), F-measure and false positive rate (FPR) described underneath: When : • An intrusion is successfully predicted we call it a True positives (TP).
• An intrusion is wrongly predicted we call it a False negatives (FN).
• A normal connection is wrongly predicted we call it a False positive (FP). The classifier that was utilized in all our conducted experiments is the nearest neighbor, and for the sake of obtaining more trustworthy results we computed the average of twenty runs. The highest robust feature extraction technique must have the highest DR and F-measure rates and as much as possible the lowest FPR rate. The simulation settings used in our first experiments are as follow: for the training sample we randomly selected 1000 normal, 100 DOS, 50 U2R, 100 R2L and 100 PROBE as for the test sample we have this structure: 100 normal data, 100 DOS data, 50 U2R data, 100 R2L data and 100 PROBE also choosed randomly using the test database. So as to both KDDcup99 and NSL-KDD data sets the settings of the simulation are similar. Also the value of p is set to 0.5 in all our experiments. During our first experiments and to choose the ideal number of prinicipal components(PCs) for all the feature extraction techniques which helps drastically to raise F-measure and detection rate (DR) and decrease FPR, examine and make a comparaison between PCA, NFRPCA, L 1 -NFRPCA and L p -norm NFRPCA. To achieve that, we have performed PCA, NFRPCA ,L 1 -NFRPCA and L p -norm NFR-PCA to train our model first. Consequently, we obtained the PCs.   And so as to calculate the false positive rate (FPR), all the algorithms cited above were implemented, but only the proposed algorithms (L 1 -NFRPCA and L p -norm NFRPCA) who has the lowest FPR for both datasets, as we see clearly in the figures ( Figure 5 and Figure 6).   In the second stage of our experiments, we intend to examine all the techniques cited above under a wide range of different training dimensionnality, and examine their impact on the DR, FPR, and F-measure. To achieve that, the structure of the test data set was kept intact by fixing it at 100 normal connections, 100 DOS, 50 U2R, 100 R2L, and 100 PROBE.   Figure.8 assert that the proposed methods produce a detection rate higher than the original ones. It proves that the methods are very powerful in differentiating between normal connections and attacks.
In Figure.9, Figure.10, we can clearly see that the L 1 -NFRPCA achieve at least 5% improvement over L p -norm NFRPCA, 10% over original NFRPCA and the classical PCA, the new approaches surpasses permanently the original techniques. In terms of FPR,the Figure.11 and Figure.12 show that the L 1 -NFRPCA still gives the lowest FPR even under different dimensionnality. These results support the great capability of the new approaches to classify the connections autonomously of the training samples size.
www.astesj.com     Figure.14) exhibit the coorelation between CPU time and the number of principal components. As it is indicated clearly we observe that increasing the number of principal compenents PCs engender a huge consuming time. In addition to that, we can observe clearly in the figures that the suggested techniques are computationally speedy than the original algorithms.
To obtain higher precise results, we did an experiment in which we compared side by side the DR of every single category of attacks for PCA, NRFPCA, L 1 -NFRPCA and L p -norm NFRPCA. According to Table I and Table II. It is obviously clear that the DR of the www.astesj.com L 1 -NRFPCA and L p -norm NFRPCA for U2R and DOS attacks are often the highest compared to U2R and DOS attacks of NRFPCA and PCA.

Conclusion
As several linear statistical techniques Principal Component Analysis (PCA) has many shortcoming, it is limited just to Gaussian distribution and it has basically a high sensitivity to noise. In addition to that, principal components are frequently damaged by outliers, therefore feature extraction utilizing PCA are not credible if outliers exits in data. To tackle this issue, we proposed an effective new variants of nonlinear feature extraction techniques called Nonlinear Fuzzy Robust PCA for anomaly-based intrusion detection. The experiments performed on the popular databases (KDDcup99 and NSL-KDD), approved the effectiveness of the suggested approaches, the New variants outperfom NFRPCA and PCA in the detection of the most categories of attacks and in reducing the false positive alarms.

Conflict of Interest
The authors declare no conflict of interest.