Intrusion Detection System Using machine learning Algorithms

. The world has experienced a radical change due to the internet. As a matter of fact, it assists people in maintaining their social networks and links them to other members of their social networks when they require assistance. In effect sharing professional and personal data comes with several risks to individuals and organizations. Internet became a crucial element in our daily life, therefore, the security of our DATA could be threatened at any time. For this reason, IDS plays a major role in protecting internet users against any malicious network attacks. (IDS) Intrusion Detection System is a system that monitors network traffic for suspicious activity and issues alerts when such activity is discovered. In this paper, the focus will be on three different classifications; starting by machine learning, algorithms NB, SVM and KNN. These algorithms will be used to define the best accuracy by means of the USNW NB 15 DATASET in the first stage. Based on the result of the first stage, the second one is used to process our database with the most efficient algorithm. Two different datasets will be operated in our experiments to evaluate the model performance. NSL-KDD and UNSW-NB15 datasets are used to measure the performance of the proposed approach in order to guarantee its efficiency.


Introduction
The number of computing devices has grown at a rapid rate. Laptops and desktop computers, as well as smartphones and tablets, have become nearly vital tools in everyday life, and many people use them on a regular basis. The main issue comes here the data which we get through internet has to be secured; This security of data over network is done by Intrusion Detection System (IDS). An intrusion detection system (IDS) is a software application or device that monitors system or network activity for policy violations or malicious behavior, and generates reports for the management system. The need for an intrusion detection system is undeniable; thus, an accurate model must be developed. In this field, machine learning has proven to be an effective investigation device that can detect any irregular event taking place in any system's traffic. To build a good IDS it well be able to detect malicious traffics with a high efficacy; the accuracy of algorithms of classification well decide that efficacy.
In this work, we propose an IDS approach for detecting malicious network traffic with more efficiency and higher accuracy at the first a presentation of our DATASET that will be trained by 3 different algorithms of classification; the next section represents ours second DATASET but this time it well be trained by the higher accuracy of the tree algorithms bellows; the last section is a conclusion as well as some issues which have been highlighted for future research. * Corresponding author: rachid.tahritr@gmail.com

Dataset Descriptions
Many datasets are publicly available online for research purposes. According to an examination of the literature, some of them were created decades ago and may not be very useful in detecting recent threats. Some examples of such datasets include KDD98 and KDD'99.
UNSW-NB15 was created in 2015 in the cyber range lab of the Australian Centre for Cybersecurity, according to (Slay N. M., 2016). CSV files are one of the dataset's formats. We are not using the original CSV files because they include over 2.5 million records split into four files.
We are using the polished CSV files in our research since they have 175,341 transactions and 82,332 entries in the training and testing sets, respectively. There are 47 features in the dataset, including numeric, nominal, and categorical data types. It is a binary and multi-class labelled dataset. The distribution of each assault in training and testing sets is shown in Table 1.

NSL-KDD
KDD'99 is outdated and contains redundant records, resulting in network intrusion detection inaccuracy. The problem is solved in NSL-KDD, which is a developed version of KDD'99. The training set of NSL-KDD has 125973 data points, whereas the testing set contains 22544 data points. It features 41 variables with numeric, binary, and nominal data types, as well as a label. Dos, probe, r2l, u2r, and regular class are the four major groups of attack types in the dataset. The distribution of each assault in training and testing sets is shown in Table  2.

Related works
Abhishek Divekar et al (A. Divekar, 2018) used classification algorithms such as Naıve. Bayes, Kmeans, neural network, RF, SVM, and DT and compared performances for alternatives KDD'99. They found that UNSW-NB15 is a better and modern alternative for the KDD'99. The result of the study showed that classifiers trained in terms of f1-score were much better than those trained with KDD'99 and NSL-KDD.
The authors of (Srivastava, 2018) have attempted to assess the performance and effectiveness of NIDS. They have used two characteristic reduction methods, LDA and CCA. Seven classifiers were applied with different measurement parameters and metrics such as FPR, training time, accuracy, the ROC zone. The algorithms used are the random tree, the naive bayes, the rep tree, the RF, random committee, randomizable bagging, and filtered. The result with LDA and random tree on UNSW-NB15 was declared best.k2 In (

Classification Algorithms
Market analysis, science exploration, production control, and other applications can all benefit from the retrieved data. One of the key principles in the machine learning method is classification algorithms. They're used to sort unlabeled data into different categories. The following are the algorithms that were employed in the work: Support Vector Machine (SVM): When compared to other algorithms, SVM is one of the most reliable classification algorithms in machine learning since it offers a rapid and easy prediction process. It creates a hyperplane that separates the class labels into their associated classes by classifying data points based on support vectors in a data source.

Fig. 3. SVM
K-Nearest Neighbor (KNN): is another reliable classification algorithm used for classifying data classes. One of its promising features is that it can be used for both classification regression purposes.

Fig. 4. NSL-KDD details
Naïve Bayes (NB): They are capable to forecast the probability that whether the given model fits to a particular class. It is based on Bayes' theorem. It constructed on the hypothesis that for instance, for a given class, the attribute value is independent to the values of the attributes. This theory is called Class Conditional Independence.

Methodology
Comparative analysis done between SVM KNN and Naïve Bayes for classification of dataset, to analyze their accuracy. At first raw dataset is taken and the class attribute contains 19 different types of attack which get labeled under 5 categories. They are normal, Dos, Probe, r2l and u2r. Figures   Fig. 6. Processes of testing

Conclusion
In this search the first database is treated by the three algorithms SVM NB and KNN with neighbourhood of 3. The elimination of KNN is made following the weak result obtained then the second database is treated by the two other algorithms. SVM has shown a good performance whatever the size of the database or the type of attacks it contains this model will be optimized in future works in terms of processing time and also we will work on its implementation in a firewall and test it in real time