Machine Learning for Network Intrusion Detection Based on SVM Binary Classification Model

A R T I C L E I N F O A B S T R A C T Article history: Received: .23 June, 2020 Accepted: 20 July, 2020 Online: 25 August, 2020 Recently, the number of connected machines around the worldwide has become very large, generating a huge amount of data either to be stored or to be communicated. Data protection is a concern for all institutions, it is difficult to manage the masses of data that are susceptible to multiple threats. In this work, we present a novel method of Intrusion Detection System (IDS) based on the detection of anomalies in computer networks. The aim is to use artificial intelligence techniques in the form of Machine Learning (ML) for intrusion detection. For this purpose, we have proposed a Support Vector Machine (SVM) classification model with two kernels, one Polynomial and the other Gaussian. This model is trained and tested with the recent UNSWNB-15 dataset. Regarding the results obtained, we have evaluated our model with six metrics capable of offering all potential threats. As a result, we have achieved a percentage of 94% for the detection rate (DR).


Introduction
The application of artificial intelligence in the field of computer network security has become an inescapable reality. In this context, we propose an extended version of an oral communication [1] which deals with network intrusion detection based on the SVM model.
The use of computer network services has become essential in several areas, particularly those related to daily life. The exchange of data can be very costly in the case of sometimes fatal threats, especially when it is a large-scale exchange at the level of governments or businesses [2]. These threats affect the integrity, confidentiality, and availability of these data. Possible solutions for the protection of computer networks are divided into two categories, those related to hardware or those related to software. We can cite firewalls, antivirus, cryptography, etc. as solutions. However, these solutions are insufficient to protect all networks [3]. In order to strengthen network security, another higher level of security such as IDSs is needed.
An IDS is a medium that allows us to protect our computer network from any unwanted activity as described in [4]. It filters incoming and outgoing network traffic to detect possible threats in the network. We distinguish between two types of IDS, the classical ones based on signatures as in the open-source software SNORT. They have a database of intrusion signatures. The signature of each incoming entity is compared with the signatures present in this database. Thus, the signature-based-IDS is very efficient for the detection of common threats and generates a very low FAR (False Alarm Rate). Despite this, it is unable to detect modern attacks. The second type of IDS is based on artificial intelligence techniques using ML for the detection of new attacks (0 Day attack) as described in the work [5]. However, the only problem is the very high false alarm rate generated by this type of IDS. This is the reason why several scientific research works are interested in this topic.
In this article, we present an IDS based on the technique of ML. We have chosen to use an SVM binary classification model associated with a UNSWNB-15 dataset for training and testing the proposed model. This article is structured into seven sections: In section 2, we present the related work. In the next section, we discuss the basic principle of an IDS. Then, section 4 concerns the existing data sources and the description of the UNSWNB-15 data set that we have used in this paper. In Section 5, we present our methods. In section 6, we show the performance of our proposed model with a comparison and interpretation of the results obtained. In the last section, we have a general conclusion and the perspectives of this work.

Related Work
In recent years, protecting computer networks against various threats is a priority for most companies. The evolution of cyberattacks requires the use of highly developed ways for the detection of these threats. Among the first works, we can cite the work of James Canady [6] who used the Artificial Neural Network (ANN) classification model in 1998. Netssan P, et al [7] obtained a detection rate of 85.87% with the KDD-CUP 99 dataset by combining two naive Bayesian models and decision trees. Other work using the same KDD-CUP 99 dataset has been proposed by O. Deepen et al [8]. They obtained a DR of 99.9% so the FAR is too low in the order of 1.25%. Also, Divyatmika et al [9] achieved a detection rate close to 99% with the Multilayer Perceptron (MLP) classification model algorithm using KDD-CUP 99 dataset. Finally, G-H Zhou [10] proposed a very efficient model (Detection rate = 99.54%) using a KDD-CUP 99 dataset for an IDS.
In parallel to the above, several works have used different ML algorithms for the detection of threats in computer networks. We cite the work of G. Meena et al [11] who recently published a paper using some Machine Learning algorithms associated with the two datasets NSL-KDD and KDD-CUP 99. Also, L. Dhanabal et al [12] published a study evaluating the efficiency of ML algorithms with the NSL-KDD dataset.
The two datasets KDD-CUP 99 and NSL-KDD are used in most of the previous work. They remain ineffective for the detection of recent attacks, even though the detection rate is very high for the different classification algorithms [13]. Therefore, it is necessary to use a newly updated dataset containing the behavior of modern network traffic. In Section 4, we will develop a study of the different existing datasets to show the effectiveness of a modern intrusion detection system.

Intrusion Detection System
An IDS allows us to protect our computer network infrastructure from any malicious activity. These activities are generally targeting the integrity, confidentiality, and availability of data in our network.
We classify IDS according to several criteria such as the type of IDS and the associated classification method. Regarding the type, we have the (N-IDS) and the (H-IDS) ( Figure 1): • N-IDS (Network Intrusion Detection System): It secures the entire network. Its location in the network affects the false alarm rate. For example, if the NIDS is located upstream of a Firewall (Figure 1), then it generates fewer false alarms. Since the traffic is already filtered by the Firewall. Otherwise, if it is placed downstream of a Firewall (location 2). It generates more false alarms.
• H-IDS (Host Intrusion Detection System): It allows the machine to be secured on its own without the use of other systems.
The IDSs are classified according to the detection method as in the work [14]. The first is signature-based-IDS. In this case, the IDS have a signature base compared with the signature of each incoming entity. This allows the IDS to be very favorable for the detection of already known intrusions. However, this first class of IDS only detects modern attacks (0 Day attack) i.e. threats with unknown signatures [15]. The second detection class is anomalybased-IDS as in [16]. In this work, we have used this last class of threat detection. We have focused on new machine learning techniques. Knowing that an IDS builds a user profile based on the normal behavior of the system. It then monitors the network to identify any abnormal activity in the network. The anomaly-based-IDS can detect modern attacks, but it generates a high FAR in contrary to the signature-based-IDS. Figure1: Architecture of an IDS

IDS-Dataset
Since the end of the 1990s, research work has been developing datasets for IDSs. These data sources are collected from network traffic simulators or by real systems, although the latter is very difficult to achieve because it may contain private data. One of the most widely used data sources is DRAPPA98 [17] published publicly in 1998 by MIT. Also, the dataset KDD-CUP 99 [18] published in 1999. This dataset contains the features of four categories of attacks (DOS, Probe, R2L, U2R). The major problem encountered is the considerable amount of redundant data which makes the use of this dataset difficult. We can also mention NSL-KDD dataset [19]. It is an updated version of the KDD-CUP 99. The NSL-KDD dataset is widely used for the network intrusion classification. Finally, we have other but less popular datasets such as CAIDA, DARPA, ISCX dataset, CICIDS.

Selecting-dataset
Previous researches are based on the two datasets KDD-CUP 99 and NSL-KDD for the classification of threats in computer networks. The authors of the works [13,20,21] have shown that these datasets encounter significant problems and their use is unreliable. These datasets suffer from a lack of diversity in traffic volumes. They contain duplicate data which does not cover real and recent attacks. To overcome these dataset issues, we have adopted the UNSWNB-15 [22] data source for training and testing our proposed model. Table 1 presents a comparison between the three datasets KDD-CUP 99, NSL-KDD, and UNSWNB-15.  [23]. The UNSWNB-15 dataset contains 49 features categorized into six data associations (See Table 2 below). The UNSW-NB15 dataset provides two label features: 'Attck_cat' represent the nine attack types (see table 3). It is used for a Multi-Class classification. The feature 'label' is used for binary classification (Normal and Attack). In this work, we choose to use a binary classification based on the second label feature.

DOS
A DOS attack is a malicious activity in a computer network whose purpose is to prevent a network service from working for a given time.

Fuzzers
Is a method that randomly generates data to be injected into programs to detect possible vulnerabilities in a computer system.

Backdoors
A malicious program that provides access to a remote system for an unauthorized user.

Exploit
Consists of exploiting a software or hardware vulnerability in a computer network.

Analysis
This type of threat makes it possible to use vulnerabilities in a computer network to collect data useful for identifying future attacks.

Generic
This category opposes all block ciphers regardless of their structure. It concerns block cipher defined by given key sizes and blocks.

Recognition
This attack consists of discovering a computer system in order to detect vulnerabilities for future attacks.

Shellcode
A block of code to be injected into a computer program by a hacker. This makes it possible to exploit system deficiencies for malicious purposes.

Worms
A worm activates in a computer system by exploiting various vulnerabilities. It then spreads across the network by selecting other targets.

Proposed model
As we have previously shown, a signature-based-IDS is very efficient to detect the known attacks. However, it has a very low detection rate for recent attacks with unknown signatures. Figure  2 shows the Machine Learning steps followed in our model for the detection of attacks in a computer network. We used the SVM model to separate the two classes (Attack and No-Attack). Since in our case the data are non-linearly separable, we used the two functions with Polynomial and Gaussian kernel. These two functions are known by their good performances at the classification level in the SVM.

Data preprocessing
To train and test our model, a pre-processing step is necessary to select useful data ready for classification.
• Data-mapping: the first pre-processing step is data-mapping.
It consists in transforming nominal data into numerical data (see figure 3). For example, the state field contains nominal data (INT, FIN, CON). In this phase, we converted these nominal values into numerical values (1,2,3). (1)

The SVM classifier
In this work, we have two observed classes (Attack and No-Attack). We used the SVM model for binary classification. This model is one of the algorithms of Machine Learning that have provided great performances either in classification or regression. It is a supervised learning model developed by Vapnik. In the case of classification, as in our case, the SVM model consists in finding a better boundary between the two observed classes (attack and No-Attack) (see figure 4). In this case, the data are non-linearly separable. For this purpose, we propose the kernel trick to have an optimal data separation as described in the work [24]. We assign two kernels one Gaussian and the other Polynomial because the data of the UNSWNB-15 dataset are also not linearly separable.

Criteria for Evaluating an IDS
The evaluation of an IDS is dependent on two main criteria: • The Reliability of the IDS: The IDS must produce an alert for each incoming threat. An unreported intrusion is a system failure. • The relevance of the IDS: Each alert must relate to a real threat. If we have a normal input event and it is classified as an intrusion also considered as well as a system failure. Table  4 presents the 4 possible cases. In our case, the confusion matrix gives the number of right and wrong predicted attacks, the same for the normal class (No-Attack).

Evaluation metrics of an IDS
To illustrate the effectiveness of our model, in this work we used six evaluation metrics.
We begin with the TPR (True Positive Rate) or Sensitivity. It describes the proportion of the positive class (Attack) correctly classified. The TPR calculates the quality of the IDS in terms of detection. To calculate the TPR, we use formula (2), note that the TPR must be close to 1 for a good IDS.
On the opposite, the FPR (False Positive Rate) or Specificity. It describes the proportion of the negative class (No-Attack) correctly classified. To calculate the FPR, we use the formula (3), note that the FPR must be close to 0 for a good IDS.
An effective IDS should provide a TPR close to 1 and a FPR close to 0.
The third evaluation metric is accuracy, which provides the true predictions for the entire test base. We calculate the accuracy according to the following formula (4): Accuracy calculation (PR) provides the proportion of true positive predictions in the forecast class (Attack). The accuracy is shown by the following formula: The fifth metric is RECALL (RE). It returns the proportion of correct predictions in the attack class. It is calculated according to the formula (6): The last metric that we have used is F1-Score. This is a calculation of the harmonic average between precision (PR) and recall (RE). It is calculated according to formula (7):

The obtained results
Our model is tested with the UNSWNB-15 dataset. The SVM classifier is used with two kernel functions Polynomial and Gaussian for binary classification. With this model, we obtained good results in terms of DR for the classes normal and malicious.
The figures 5 and 6 show, respectively the confusion matrix for the two kernels SVM-Gaussian and SVM-Polynomial.  To present our obtained results, we have plotted the ROC curves for the two kernels. These curves represent the TPR (Sensitivity) as a function of FPR (Specificity). They show that we have a very high rate of true positives (close to 1) against a low rate of false alarms (close to 0). This validates the performance of our model.   Tables 5 and 6 show the performance of the two kernels SVM-Gaussian and SVM-Polynomial: • With the Gaussian kernel, we obtained a DR of 94% for the normal class and 91% for the malicious class. • For the Polynomial kernel, we found slightly high results compared to the Gaussian function. We obtained a detection rate of 95% for the normal class and 93% for the abnormal class.

Performance evaluation of the proposed model
To evaluate the effectiveness of our model, we compared our obtained results with those of other work using the same UNSWNB-15 dataset ( Table 7). The comparison results of our SVM model with other classification models are good. We obtained for the Polynomial kernel an accuracy equal to 94% as shown in the following table:

Conclusions and perspective
In this work, we have developed a novel method for detecting attacks in computer networks. For this purpose, we presented a binary model for the classification of intrusions. We used the UNSWNB-15 data source. This source is widely used in the field of cybersecurity. We based on the SVM classifier with two kernel functions SVM-Polynomial and SVM-Gaussian. The evaluation results show high performance for both Gaussian (Accuracy = 93%) and Polynomial (Accuracy = 94%) kernels. In perspective, we will opt for a multi-class classification for the detection of each type of intrusion separately.