Malware Classification Based on the Behavior Analysis and Back Propagation Neural Network

With the development of the Internet, malwares have also been expanded on the network systems rapidly. In order to deal with the diversity and amount of the variants, a number of automated behavior analysis tools have emerged as the time requires. Yet these tools produce detailed behavior reports of the malwares, it still needs to specify its category and judge its criticality manually. In this paper, we propose an automated malware classification approach based on the behavior analysis. We firstly perform dynamic analyses to obtain the detailed behavior profiles of the malwares, which are then used to abstract the main features of the malwares and serve as the inputs of the Back Propagation (BP) Neural Network model.The experimental results demonstrate that our classification technique is able to classify the malware variants effectively and detect malware accurately.


Introduction
Malicious software (Malware), usually in forms of virus, Trojans, worms, botnets, rootkits, and some other potentially unwanted applications, has been the major threat to the internet security.Malware developers use the hiding techniques such as polymorphism and obfuscation [1] to against signature-based on detection and static malware analysis methods easily and effectively [2].In contrast to static analysis, dynamic analysis of malware based on monitoring its behavior during the run-time, which renders the malware more difficult to conceal [3,4], and it does become the mainstream method of malicious behavior mining.
Yet these dynamic technologies for the malware detection are not sufficient just by forming detailed behavior profiles [5].What we need is the ability to automatically categorize of the malware and detect the malware by its behavior.
The main contributions of this paper are as follows: 1)Unlike many previous algorithms that monitor the malware behaviors directly on low-level data such as API call monitoring [6], we implement an automatic dynamic analysis framework by taking the advantages of the present behavior analysis systems.We get the detailed behaviors of the malware including process behaviors, registry behaviors, file behaviors, net behaviors, and other behaviors.
2)We extract the major features of the malware behavior profiles into the behavior vectors by counting the quantity of the every behavior.
3)We proposed a Back Propagation (BP) Neural Network model [7] for learning the behavior patterns of the same categories of the malwares and classifying the malwares.The experimental section verified the correctness and precision of our algorithm finally.

Related Work
There are three major methods for classification of the malicious softwares, traditional pattern matching, static analysis, and dynamic analysis.Although static analysis can improve accuracy than the methods of the traditional pattern matching, it can also have the difficulty to handle obfuscated and self-modifying codes.
In dynamic analysis, Konrad Rieck [8] et al. proposed a method using the CWSandbox to analysis the behaviors of the malwares and then using the Support Vector Machines (SVM) for learning and classification.Forrest [6] proposed fixed-length sequence of N-gram recognition model based on system call.Syed Zainudeen Mohd Shaid [9] proposed a behavior-based technique to visualize malware behavior in the form of images.This method uses the different color to indicate the different API calls.By using the behavior images, it can be possible to visually identify malware variants of the same family.Guanghui Liang [10] et al. capture malware behaviors based on the Temu platform and proposed a weighted Jaccard similarity matching algorithm to classify the malware variants.
In summary, when dealing with the malware variants classification, behavior analysis is the most effective method.In this paper, we use the present effective behavior analysis system to help analyse of the malware in contrast to just monitoring the API calls or the traces of the APIs as above.

Methodology
Classification of malware variants has been concerned by analysts in a long period [11][12][13][14].Evolving malware generates a lot of variants and brings great challenges to analytical work.Although these variants change in the file format and appearance, there are still the same behavior patterns.For example, all variants of the Allaple worm acquire and lock of particular mutexes on infected systems [8].Aiming to exploit these behavior patterns using machine learning techniques and propose a method which can classify the malware variants automatically based on their behaviors.An outline of our approach is given by the following basic steps: 1) Malware Data Acquisition.A corpus of malware binaries are obtained by collecting the upload suspicious files on the Kafan Forum.A Multi-Engine Online Virus Scan system VirSCAN is applied to identify the known malware instances.
2)Behavior Monitoring.Malware binaries are executed and monitored by the HABO behavior analysis system, which can generate detailed behavior reports.
3)Feature Extraction.Features reflect the behavior patterns, such as process created, foreign memory regions read, mutexes created, or registry key modified, are extracted from the analysis reports and used to map the malware behavior into a high-dimensional vector space.
4)Learning and classification.Back Propagation neural network model is applied to learning and training for the classification of the malwares.

Malware data acquisition
We have obtained up to 13600 unique samples, which are uploaded by the extensive users of the Kafan Forum, using for learning and subsequent classification.After obtaining the samples, we applied the online virus scan system VirSCAN to partition the malwares into common families, such as Adware, Potential Unwanted Application (PUA), Trojan/Downloader.Note that we chose the VirSCAN instead of one Unti-virus product, like Avira, Karpasky, to label the malware as the VirSCAN is multi-engine and we can chose the most of the result to label our sample.We selected 9 most common malware categories and one Non-Malware category on our samples.These families listed in Table 1 represent a broad range of malware categories such as Adware, PUA, and Trojans, and the Non-Malware category can be extend for malware detection directly in the future.

Behavior Monitoring and Feature Extraction
In this section, we use the online behavior analysis system, which called HABO, to monitor the samples' behavior.Like most of the other online behavior analysis systems, it can analyse the upload binaries and give you a detail behavior report about the malware.Note that our methodology is not bound to the HABO system; it can also be adapted to other behavior analysis systems.
Figure 1 shows a part of behavior report of one malware sample.It contains five main aspects, process behaviours, file behaviours, register behaviours, net behaviours, and other behaviours.Furthermore, these 5 main aspects contain 73 sub-behaviors which describe the behavior of the malware in detail.

Figure 1. Behavior Report of The Malware Sample
Although the reports show the detail information of the behavior of the malware, it can't be used for the BP neutral network model directly, which needs the vectorial data as the input.Hence we should extract the main features of the malwares' behavior from the reports firstly.The method used here is called frequency statistics method.The main steps are showed as follows: 1) Given all the sub-behaviors, we represent them at a particular sequence, such as, create local thread, enumerate process, create a new file process, and so on.
2) We made a count on the all sub-behaviors of the malware respectively.For example, the behavior showed in Figure 1, we can use [2, 1, 1, and 15,] to represent.
The figure 2 shows a detail behavior vector of one malware, and the number zero means that this malware doesn't have the behavior correspondingly.

Establish the Model
According to the previous description, the norm of our behavior vector is 73.And we can use the vector, r = [r1, r2, r3,…, r10] ri = 0 or 1, to represent the output result.Each element value of the output vector is 0 or 1, 0 represents the malware does not belong to the corresponding category while 1 represents the malware belong to the corresponding category.
Given the input vector and the output vector, we established a Back Propagation (BP) Neural Network, which includes one input layer, one hidden layer and one output layer.The input layer and the hidden layer both have 73 neurons and the output layer has 10 neurons.Additionally, the hidden and output neurons include adjustment factor a and b respectively.The connection weight between input layer and hidden layer, hidden layer and output layer is noted by . The network is showed in Figure 3. ( ) where , ( ) ( ) and the symbol  means the learning rate of the network.
In this paper, the transform factor between the output layer and the hidden layer is denoted by a , and b means the transform factor between the hidden layer and the output layer.

1) The identification of the output function
Due to there is no experienced output function, we can only get the function by experiments.For the same samples, when trained 10000 times, we can get the results as shown in Table 2.And from the table, we finally choose the linear function   1.0 / 5000 Figure 4 and Figure 5 show us the relationships between the adjustment factors and the convergence time, when the total error is set to be 0.64.

Experiment and Evaluation
In order to evaluate the performance of our methodology, we firstly divided the malware corpus randomly into training and testing two partitions, and the samples sizes are 10000 and 3600 respectively.We used the training partition to train the BP neutral network, and used the testing partition to measure the overall performance of our methodology.Besides, the procedure showed above is repeated over five independent experimental runs and we use the average values as our final results.
The per-category accuracy for this experiment is shown in Figure 6, and the error bars indicate the variance measured during the experiment runs.From the figure, we can find our average accuracy is up to 86%.And in particular, we can find the last category, which we defined it as the non-malware, whose predict accuracy is up to 99%.In other words, if we used this model to detect the binary in our corpus is whether the malware or not, we have the correct probability approximate to 99%.This result shows that our methodology can be easily extend for malware detection.And more deeply, due to the boundaries of categories 3,4,5,6 and 9 are less obvious, which labelled as Trojan, Tr/downloader, Tr/Crypt, Tr/dropper amd Win32, we find that the variance of these categories are higher than other categories.

Figure 6. Accuracy per category
Figure 7 shows the confusion matrix for classification.If the color of the category is deeper, it means that this category is less error probability be classified into other categories.From the figure, we can find that the categories between 3 and 7 are easily be confused each other, and the category Win32 is most likely to be considered into other categories, which are consistent with our actual situation.

Conclusion
In this paper, the behavior of the malware is captured by the online behavior analyze system.After that, we extracted the main feature of the malware in forms of vector and serve it as the input of out Back Propagation Neural Network model.Finally, by training the BP Neural Network, we can use it to classify the malware and detect the malware.Experimental results show that our methodology can classify the malware variants effectively and detect the malware accurately.
In the future work, we will focus on how to extracted malware feature that can represent the malware more accurately from the behavior analysis reports.

Figure 2 .
Figure 2. A Detail Behavior Vector of One Malware

Figure 3 .
Figure 3.The Bp Neural Network

2 )
The identification of the Adjustment factors Due to the value of the adjustment factors a and b has the important influence on the speed of the convergence.

Figure 7 .
Figure 7. Confusion of Categories

Table 1
Malware Families Labeled by The Virscan System