Malware Classification Based on System Call Sequences Using Deep Learning

Article history: Received: 27 March, 2020 Accepted: 06 June, 2020 Online: 22 July, 2020 Malware has always been a big problem for companies, government agencies, and individuals because people still use it as a primary tool to influence networks, applications, and computer operating systems to gain unilateral benefits. Until now, malware detection with heuristic and signature-based methods are still struggling to keep up with the evolution of malware. Machine learning is known to be able to automate the work needed to detect families of existing and newly discovered malware. Unfortunately, the machine learning method using Support Vector Machine (SVM) for detecting malware can only reach a low level of accuracy. In this work, we propose a dynamic analysis method and uses a system call sequence to monitor malware behavior. It uses the word2vec technique as word embedding and implements deep learning models, namely Long Short-Term Memory (LSTM) and Nested LSTM, as classifiers. To compare with existing machine learning approach, we also apply the Support Vector Machine (SVM) as a benchmark method. The Nested LSTM gets an accuracy of 93.11%, while the LSTM gets the best accuracy of 98.61%. The LSTM also achieved the best performance in terms of average precision at 97.57%, the average recall at 97.29%, and the average score of f1 at 97.43%. We have found that our model is lightweight but powerful for detecting malware with significant accuracy.


Introduction
The high use of the internet increases the level of connectivity of electronic devices, making questions about the integrity of the system. Conventionally, software and computer systems are developed for good purposes. However, some software was developed to produce crime (malware). Malware is a common word used for programs that have malicious code snippets that can cause significant threats to computer users or any digital device. Malware can contain malicious code viruses, worms, Trojan horses, can also make a back door to divulge personal information or control a person's system. Through malware, serious crimes can be done; This is why malware detection is needed [1]. To detect malware definitions must be made for analysis of which malware is essential. Malware analysis consists of analyzing various aspects of malware so that malware can be detected [2]. The definition of malware is also known as a signature/signature. This signature is used by virus scanners known as anti-viruses to detect malware. The research will experiment on seven types of malware, which are adware, backdoor, packed, riskware, trojan, virus, and worm.
Traditional malware detection is done on susceptible files that are not processed. This is mostly done with a signature, heuristic, and behavioral approach. The signature approach looks for static patterns of malware known in suspicious files [3]. Research has shown that the signature approach is very weak in dealing with polymorphic and metamorphic malware. The heuristic approach checks the characteristics of suspicious malware from suspicious files. Despite being able to detect unknown malware, they are very high at the false-positive level.
The Behavioral Approach monitors the implementation of programs to monitor suspicious behavior. Although this approach can detect different malware variants, this approach also has a high false-positive [4]. To help malware analysts retrieve useful information from large malware samples, the need for automatic classification in statistical variants is needed. Malware detection based on a signature cannot overcome this variant because it does not take polymorphic malware into account. Polymorphic is a form of malware that frequently always changes its identifiable features to evade detection. Furthermore, such a system can be easily avoided.
The most important event that can be tracked to determine malware behavior is the system call. Before malware performs a malicious action, malware needs to use the operating system (OS) service of the target. For each activity that is carried out, such as opening a file, running a thread, writing a command to the administrator, or opening a network connection, interaction with the operating system is required. This interaction is carried out via the API call system of the target OS. Therefore, monitoring the behavior of malware is very important to monitor the order of system calls during malware execution. Different malware families certainly have different goals.
Detected malware is easily handled mainly by elimination. However, the current nature of malware is polymorphic and metamorphic, making them difficult to detect in traditional ways. They disguise their structure but not their operations. Because all malware must be executed to carry out its malicious actions successfully, some studies [5], analyze API calls to detect malware in high accuracy execution. However, this detection ends by marking malware or not malware [6]. It does not classify malware into its type (viruses, worms, Trojans, etc.). Classification is important because it helps simplify the course of action to neutralize it.
Research on malware classification has been done before. However, these studies do not use the Word2vec method. One example is a study of classification in system call sequences conducted in 2019, wherein that study the classification contained nine types of malware, namely kelihos_v3, vundo, rammit, lollipop, simda, tracur, obfuscator.ACY and gatak. The methods used are text and hex commands and LSTM [7]. Also, in 2016 there was research on evaluating machine learning methods such as the Hidden Markov Model [8] and SVM [9] in determining malware classification.
We use word embedding techniques in processing to convert malware system call sequences into vectors to achieve an increase in capturing the relationship between n-grams in the system call sequence and then proceeding to LSTM for the classification process. In essence, this approach expected to improve accuracy and precision for most families of malware, which brings a significant improvement from the methods used by previous researchers and thus can help to classify malware more accurately.
All sections of the paper have been organized as follows. Section 2 discusses related works on the detection and classification of malware. In section 3 provides details of the background Theories, word embedding, and deep learning methods. Section 4 discusses the details of the dataset, the methodology used, and the evaluation design. Section 5 discusses the details of all experimental results, including training and testing results. The conclusion and future work are places in section 6.

Related Works
Previous researchers have shown that using program behavior features such as API calls can detect malware, including metamorphic and polymorphic malware, with high accuracy. This is because, at a higher level, malware disguises itself by changing their behavior or continuously changing their signatures. However, to cause damage, they must execute and change execution behavior more difficult. This can make them harmless. Therefore, this approach targets malware at the execution level.
The first researcher who used a deep learning-based malware detection (DLMD) approach relied on static methods to predict behaviors that can be executed using system call sequences that provide sequences taken from running processes. Using SVM and CNN the results show that this method is quite effective in detecting polymorphic and metamorphic malware with an accuracy and detection rate of 89% to 96% [10]. In the proposed DLMD technique, SVM is used as a feature selector and CNN autoencoder is used as a feature extractor. After that, a Multilayer perceptron is used as a classifier.
Other researchers develop an 18-layer deep residual network to be issued bytecode to a 3-channel RGB image and then apply deep learning to classify malware. To convert malware to images, they first convert malware binaries to 8-bit vectors (bytecodes) [11]. After that, the bytecodes are converted into grayscale images with contribution values from 0 to 255, each vector that turns into pixels with added values from 0 to 255. In the next step, they then convert the grayscale images to 3-channel RGB images with duplicate the grayscale channel three times and then collect all three channels to create an RGB image. Their experimental results show that the network residual model achieves an average accuracy of 86.54% with 5-fold cross-validation.
In [12], the author proposed a new malware detection method based on Deep Graph Convolutional Neural Networks (DGCNNs) to learn directly from the sequence of API calls and related behavior graphs. The experimental results show that the model reaches a similar area under the ROC curve (AUC-ROC) and F1-Score of Long-Short Term Memory (LSTM) networks that produce up to 96%.
In [13], the author proposed a method for detecting malware variants that are packaged based on sensitive system calls and the Deep Belief Network. Different experimental groups and different data samples were used for analysis. The 10-fold cross-validation method is used for classification. Theoretical analysis and experimental results show that the proposed method can detect packed malware which reaches an accuracy of 92% and requires a detection time of fewer than 0.001 seconds.
In [14], the author proposed a conventional approach with deep learning-based using Recurrent Neural Networks (RNN) that are vulnerable to redundant API injection. They investigated the effectiveness of Convolutional Neural Networks (CNN) against injection of redundant APIs. Their malware detection system converts malware files into image representations and classifies image representations with CNN. CNN is implemented with spatial pyramid pooling layers (SPP) to handle various input sizes. They also evaluated the effectiveness of SPP and image color space (greyscale / RGB) by measuring system performance on unaltered data and adversarial data with the injected redundant API. The results show that Naive SPP implementation is not impractical due to memory constraints and effective greyscale imaging against redundant API injection.
The last researcher proposed an approach of how deep learning architecture using the stacked AutoEncoders (SAEs) model can be designed for intelligent malware detection. The SAEs model functions as a greedy layerwise training operation for unsupervised feature learning, followed by supervised fine-tuning parameters (eg Weights and offset vectors). Based on the representation of different features, various types of classification methods, such as Artificial Neural Networks (ANNs), Support Vector Machines (SVM), Naïve Bayes (NB), and Decision Tree (DT) are used as a model construction to detect malware. Most of these methods are built on shallow learning architectures. Even though they have succeeded in isolating malware detection but shallow learning architectures are still unsatisfactory for malware detection problems [15]. The experimental results of the method showed that the proposed method achieves 96% accuracy. The bibliography comparison of previous works are summarised in Table 1. Based on the results of the literature review, there have been previous studies that have tried to classify malware based on system call sequences data. But, the methods used before did not achieve high classification accuracy. In other fields, many methods of deep learning have proven to be more accurate and therefore we use deep learning and the word2vec as a word embedding to improve accuracy. Since deep learning models are used, therefore we do not use feature extraction specifically like the study above. However, we use a word embedding, which convert the input text into numeric data as input to the LSTM model. As a result, it will increase classification accuracy.

Word2Vect
Word2vec is a two-layer neural network that can process text by converting words into vectors or can also be called "vectorization." Input from word2vec is a collection of text, and the output is a collection of vectors. Feature vector representing words in a corpus. Word2vec is not a deep neural network. Word2vec works by converting text into numerical forms which can then be translated by deep neural networks. Word2Vec is a word embedding technique that is quite popular and was developed by [16] at Google.
Word2vec can also be applied to codes, likes, playlists, social media graphics, sentiment sentences, and other verbal or symbolic series where patterns can be seen. The purpose of word vectorization is to group word vectors that are similar in vector space, which can later detect mathematical equations. Word2vec functions by making a distributed numerical vector representation of a word. For example like in the context of an individual word.
Word2vec works automatically. With enough data usage and context. Word2vec can make very accurate guesses about the meaning of the words based on previous appearance or interpretation. These guesses are used to build the association of words with other words (e.g. "Male" means "boy" and "woman" means "girl"), or classify a document and then group them according to their topic. Clusters can form the basis of sentiment analysis, e-commerce, search, malware analysis, and recommendations in areas such as scientific research and legal discovery. The output of word2vec is in the form of vocabulary where each item has a vector, which can be entered into further processes such as machine learning or deep learning. Also, it can be used just to detect the relationship between these words.  Figure 1 is a Word2vec Continuous-bag-of-words (CBOW) model. The way CBOW works is to take the context of each word and then make it as input and try to predict words that fit the context. As an example, When trying to predict the current target word (the center word) based on the source context words (surrounding words) [17]. If we make a simple sentence like "the black cat jump over the very big goat" this can be pairs of (context_window, target_word) where if we consider a context window of size 2, we have examples like ([the, cat], black), ([cat, over], jump), ([very, goat], big) and so on. This model tries to predict target_word based on context_window words.

Long Short-Term Memory (LSTM)
LSTM was first introduced by Sepp Hochreiter and Jurgen Schmidhuber in 1997 [18]. LSTM is a type of repetitive neuron that has been shown to increase the ability of RNN. LSTM can remove the effects of the problem by vanishing and bursting gradients, and is better to data-sensitivity relationships [19]. LSTM launched the forget gate inside the LSTM neuron, which allows accessing the information requested by the neuron allowing its access to focus on the critical parts and discard the information that is not useful.  Figure 2 shows the structure of the LSTM. The key to LSTM architecture is its cell state. Cell state can be interpreted as a memory of a network and can delete or add information to a structure called a gate. For each "t" time-step in LSTM can be described by using this formula [20]: while is forget gate, is the input gate, is output gate is a memory cell, ℎ is a hidden layer, is input when time "t", σ is sigmoid activation function, tanh is hyperbolic tangent activation function, are weight matrices for controlling the input and are bias vector. . The input of this step is the output of the previous step, which written by ℎ −1 and input. The activation function will give a result of "0" or "1", where "0" means "not let anything pass" and "1" means "remember everything". The next step is to determine what information will be added to the state of the cell. Shown by Figure 3 (B), Equations (2), and (3). At this stage, the input is ℎ −1 and . The first layer is called the sigmoid layer, which serves to determine which part to be updated. And the tahn layer is to create a new candidate value . In the next step, the two layers will be combined to update the status of cell.
In step C, the old cell will be multiplied by so that it can forget things that are no longer needed, so new information that will enter can be easily added to the cell's memory status. This section is shown in Figure 3 (C) and Equation (4). In the final step, the output of ℎ is shown in figure 3 (D), Equation (5), and (6). Output results are based on the state of the cell but in the state that is being filtered. Initially, the sigmoid layer was applied to the previous output ℎ −1 and input to determine the gate output value. The resulting value is between "0" and "1", which indicates part of the cell state is output. Then the state of a cell is changed by the tanh function to get the value between "-1" and "1". The value of the changed cell status is then multiplied by the output value at the gate, which ends with ℎ output and this output will be used for the next step in the model.  Figure 4 is an architectural drawing of a Nested LSTM [21]. Nested LSTM is a simple extension of the LSTM model that adding depth through nesting into the model. Inside Nested LSTM there are memory cells that make up internal memory and can only be accessed through external memory cells by applying a temporal hierarchy. The gate output in LSTM encodes the intuition that irrelevant memories at the current time step may still need to be remembered. Nested LSTM uses this intuition to create a temporal memory hierarchy. In Nested LSTM, access to internal memories is maintained in the same way, so that long-term information that is only situationally relevant can be selectively accessed. The equation in Nested LSTM can be described as follows: Where, is forget gate, ̃ is inner forget gate, is the input gate, ̃ is inner input gate, is the output gate, ̃ is inner output gate, is a memory cell, ̃ is an inner memory cell, ℎ is a hidden layer, h is an inner hidden layer, is input when time "t," ̃ is inner input when time "t," σ is sigmoid activation function, tanh is hyperbolic tangent activation function, ℎ are weight matrices, ̃̃ℎ̃ ̃ℎ̃̃ℎ̃̃ℎ are inner weight matrices is bias vector and ̃ ̃ ̃ ̃ are the inner bias vector.

Support Vector Machine (SVM)
SVM is machine learning that is usually used for classification or regression. SVM is also a type of supervised learning. The main purpose of SVM is to determine data with decision boundaries and extend to non-linear boundaries using kernel tricks [22]. SVM is used in many applications such as word sentiment, categorization of text and documents, pattern recognition, face recognition, handwriting analysis, and binary classification. the idea behind SVM is to share data with the best method. The binary classification used to compile we need to classify 2 data sets. In multi-classification, the most frequent method is to create a oneversus-rest classifier (OVA) where each category is divided, and all other categories are combined and to choose the class that classifies collecting data with the largest margins. Divide the class into binary problems. The classifier learning step is carried out by all training data, taking certain class patterns as positive and all other examples as negative. Support Vector Machine has three main parameters, namely, C, gamma, and kernel. The kernel is always used as the Radial Base Function (RBF) because of its best performance [23]. While C and gamma are hyperparameters that have different values and produce different accuracy and results.

Dataset Generation
We collect malware samples and track the behavior of malware using Cuckoo malware analysis [24]. The malware collection consists of samples collected from two primary sources: Virus Share [25] and GitHub / TheZoo [26]. We chose this source because it provided a large and varied sample Portable Executable (PE) file for evaluation. Because malware authors can use obfuscation and packers code for sub-vertical static analysis, we use dynamic malware analysis to collect data about malware behavior. Then, several tools allow tracking malware execution and gathering logs from the order of execution [27]. We use Cuckoo Sandbox, which is open-source and provides a controlled environment for executing malware. In the dataset experiment, that will be used as many as 13356 data, where the data is divided into three groups, namely training, validation, and testing.  Table 2 shows the distribution of the amount of training and testing data used in this research. At the training stage, the model will be trained using 8012 data, while at the data testing stage will be tested using 2672 data. Experiments will be conducted on both models. Prediction of testing data will be an experimental result where the results will be described through a confusion matrix so that the accuracy of each model is obtained.

Word Embedding
We extracted the PE file by preprocessing the PE Headers and opcodes from the code section. To use this data in the classification process, we need to make numerical vectors with word embedding. The PE file is run in the Cuckoo sandbox which is a malware analysis tool. Can extract API calls from PE files during execution. The sandbox tool is configured on Ubuntu 18.04.2 LTS along with the Windows 7 virtual environment using the Oracle virtual box where the PE files are executed. Virtual environments help in such a way that malicious files are executed and behave in the same way as in a conventional system [6]. This is very helpful in understanding malware behavior when trying to infect a system.
During PE file execution, the Cuckoo sandbox generates log files. The log file contains snapshots taken during execution (behavior profile) [28]. This is done for every sample that is executed. Each sequence of API calls is recorded according to the class label specified by Kaspersky [29] and VirusTotal [30]. We determined seven classes of malware (Adware, Backdoor, Packed, Riskware, Trojan, Virus, and Worm). The API call log that has been collected is always long and continuous. We will apply text mining with word2vec techniques. To select API calls that are relevant for classification. Word2vec helps identify a set of API calls that are more common in the malware class. This works in a way that if there is a word API call, it often appears in a class. But when it appears in many other classes, it is not a unique identifier and must be given a lower score. Only the words API calls with high scores or frequently appearing words are considered as PE file profile behavior.
Word2vec has two techniques, namely Skip-gram and Continuous Bag of Words (CBOW). This CBOW method takes the context of each word from the whole sentence or paragraph as input and tries to predict the word for word that fits the context. In contrast, the skip-gram model predicts the meaning of words after searching for their target words, and the author uses CBOW for this research. First, we did a mapping for seven labels and turned it into one-hot encoding. Then, the writer converts the whole sentence to the lower case and removes the punctuations. The next step is to create a word2vec embedding model generator to convert words to vectors with the specified model size. The next step is to create a Word Bag with the same number of words counted in various types of malware that is calculated to help determine how relevant a word is to a specific class or how often the word appears in the word bag. The code snippet for Word Bag is shown in Figure 5. The word was changed to vector using the word2vec embedding model that was created, as shown in Figure 6. After getting the vector for the word, the average value of the vector (mean) is taken and multiplied (multiplied) by the frequency of words in the class and label. The following entire preprocess process is summarized below: • Enter a sentence and repeat each word • For each word, it will be changed to represent a numeric / vector. • Take the mean vector and multiply with the number of classes and add them as features. • Pad the sentence to fixed-length 128 then move to the next sentence. After going through this process, as illustrated in figure 7, a fixed length of 128 vectors is obtained as a feature for each sentence. If a sentence has more than 128 words, the word will be truncated, and if it has less than 128 words, then padding "0" will be added so that each sentence has the same length.
After the feature making process, data mining classification is applied using a classification approach. We use Long Short-Term Memory (LSTM). Based on the type of API call chosen to describe a particular class of malware, the classification approach helps in concluding whether the file is malicious by determining the class in which the malware is. Because the process ends with the accuracy of determining the class in which the file is located after behavioral detection. All PEs have a direct relationship with the Operating System (OS) via the system calls API. This shows that API calls can easily notify malware behavior when attempting to execute.

Deep Learning Model
Deep Learning is one area of artificial neural networks to deal with problems on more large datasets. Deep Learning provides a very compelling architecture for supervised learning. By adding more layers to the deep learning model, it can do better at represent labeled malware data. To implement Deep Learning techniques for malware classification, a computer-based program is needed that can do computing. Therefore it is necessary to design an algorithm that can support the development of programs for this research. The algorithm used in this study is divided into three main parts, namely the training algorithm, the testing algorithm, and the classification algorithm. These three algorithms follow the concept of writing code with API and the basic theory of Machine Learning for feature learning. In the training algorithm, five main stages will be carried out, namely the stages of Data Augmentation, Load Training Data, Modeling Long Short-Term Memory (LSTM), Training Model, and Final Weight Storage.
The LSTM model has several layers, including the embedding layer, LSTM layer, and Output layer. The input of the model is the preprocessing text that has been transformed into numeric where the input length is 128, where each number or vector represents a word, at the embedding layer, the input will be transformed into a vector that has a length of 128 vectors. Furthermore, LSTM consists of 3 gates, which will process each input vector to produce 128 vectors and where each output is connected to the output layer. At the output layer, there are seven neurons. each of these neurons has softmax activation to make a value in each classification. The classification prediction results are the highest output value. The Nested LSTM model made consists of several layers, including the Embedding layer, Nested LSTM layer, and Output layer. Similarly, the LSTM input model of the Nested LSTM model is the preprocessing text that has been changed to numeric, where the length of the input is different where each number represents a word. At the embedding layer, the input will be transformed into a vector that has a length of 128. Furthermore, the Nested LSTM cell consisting of 3 gates (depth = 2) will process each input vector to produce 128 output vectors where each output is connected to the output layer. At the output layer, the same number neurons like LSTM and each of these neurons have softmax activation, which results in a value for each classification. Support Vector Machine Model has three main parameters, namely C, Gamma, and Kernel. The kernel is always used as the Radial Base Function (RBF) because of its best performance. C and gamma are hyperparameters that have different values between the two and produce different accuracy and results. We need to find the best C and gamma values. That is why we use GridSearch. In GridSearch, we make all possible C and gamma combinations and then choose the one that has the best. Sklearn has a GridSearch Cross-Validation (CV) function that takes the SVM model, the Cs and Gammas grid parameters, and the number of folds. The number of folds means that the data will be divided into that many folds. In this case, it is three and then is trained on two and tested on one.

Evaluation Design
In this research, the dataset will be used as many as 13356 data, where the data is divided into three groups, namely training, validation, and testing. Data need to be converted in numerical value before going into the deep learning model. So the first step is to convert labels to one-hot encoding. After that, sentences are being converted into lowercase and remove punctuations to create a clean word2vec model using CBOW. The deep learning model will be trained using 8012 data, while the data testing stage will be tested using 2672 data. The LSTM model uses Adam optimizer using 64 batch sizes and 30 epochs because after several test we found that this combination works best on accuracy and added with 512 dense layers with 20% dropping units rate to prevent overfitting and also using softmax for classification. Whereas the Nested LSTM model uses Adam optimizer using 64 batch sizes and 50 epochs added with 1024, 2048, and 7 dense layers with recurrent dropping to prevent overfitting and also using softmax for classification. The Support vector machine uses the RBF kernel and Grid Search Cross-Validation for hyperparameter tuning to find the best value for the C parameter and gammas for the training model. In this research, performance will be measured based on the level of accuracy, recall, precision, and f1-score achieved to measure the performance, the results of the evaluation will be set forth in the form of a confusion matrix. The confusion matrix contains information from actual classifications and predicted classifications [31]. All methods were implemented on Python 3.6, Jupyter Notebook 6.0.3, Tensorflow 1.15.0 version, Intel Core i5-6400 (with 16 GB RAM), and Nvidia GeForce GTX 1050 Ti GPU.

Training Result
After the two models are made, the model is trained using 13356 training data. After several trials, we decided to use 30 epochs with 64 batch sizes for the LSTM model and 50 epochs with 64 batch sizes for the nested LSTM model because they produce high accuracy in training.

Testing Result of LSTM Model
After the LSTM and Nested LSTM models are created and trained, the SVM is built and tested only as a benchmark to compare the two models above. All three models were tested using 2672 test data.  Figure 10 shows the results of LSTM model testing, where the test results are shown using the confusion matrix. As we can see in every malware label, there is not much miss. This is because the loss rate of testing data is only 0.18%. The biggest label miss here is a virus where 30 labels are considered backdoor. The rest showed outstanding results with miss under 10. Thus, the testing accuracy obtained from the LSTM model is 98.61%.  Table 3 shows all the precision, recall, and f1-scores of the LSTM model from each malware. The LSTM method obtained an average precision of 97.57%, a recall of 97.29%, and an f1-score of 97.43%. Figure 11 shows the results of Nested LSTM model testing, where the test results are shown using the confusion matrix. As we can see in every malware label, there is not much miss. It is because the loss rate of testing data is only 0.18%. The biggest label miss here is a backdoor where 66 labels are considered as a virus. The rest showed outstanding results with miss under 20. The testing accuracy obtained is 93.11%.   Table 4 shows all the Precision, Recall, and F1-scores of the model from each classification. The Nested LSTM method obtained an average precision of 84.43%, a recall of 85.14%, and an f1-score of 86.00%.

Testing Result of Support Vector Machine Model
We also developed Support Vector Machine (SVM) so that it can be used as a comparison or benchmark. To see which ones perform better using the same word embedding method. Figure 12: SVM confusion matrix Figure 12 shows the results of the SVM testing model, where the test results are shown using the confusion matrix. As we can see in every malware label, there is much miss. For example, the Virus miscalculated up to 91 labels. Followed by almost every label more than 30 miss labels because of a low level of accuracy, and this is why we propose to use deep learning methods because this type of data is not suitable for SVM. The testing accuracy obtained is only 84.50%  Table 5 shows all the results of the classification label, precision, recall, and f1-scores of the SVM model from each classification. The SVM method obtained an average precision of 73.86%, a recall of 95.14%, and an f1-score of 80.43%.

Summary of Testing Results
This section presented the result of all methods, including our proposed method, and compared it to other existing methods, which is SVM. Our proposed methods can overcome others. The results are shown in Table 6.  Table 6 shows the overall comparison of all methods. We also make plot early stopping in LSTM and Nested LSTM train process so that we can take the best accuracy model when the training process happens. Overall, the table above shows that the LSTM model produces the best accuracy of 98.61% among the three methods, although the difference in accuracy does not differ significantly from the Nested LSTM. Both LSTM and Nested LSTM methods are still better than SVM methods. It shows that the deep learning method is far more accurate compared to ordinary machine learning methods.

Conclusion and Future Work
In this paper, we investigate the effectiveness of malware system call sequences that transformed into vectors and use word2vec as word embedding and then enter the LSTM layer repeatedly for the classification process with non-linear activation functions like Softmax. We have also carried out various experiments with different parameters, network structures, and added early stopping plots to get the best model accuracy in the training process. The design of the model is also evaluated using different methods such as Nested LSTM and SVM as benchmarks. From the three models, it can be concluded that the LSTM method gets the highest accuracy reaching 98.61% in the real-world data set. Overall, LSTM is included as the most effective approach to learning long-range dependencies in cybersecurity tasks and more appropriate methods for detecting malware through system call sequences.
From this research, it can be concluded that the experimental results with the LSTM network are straight forward. Still, we have not tried to use more complex LSTM networks, such as use many different layers, use more automatic decoders or use word embedding techniques other than word2vec. It because such a network architecture will cost us more and more complex preprocessing, network architecture, and a clean dataset probably will improve the results.