# Intrusion detection system combined enhanced random forest with SMOTE algorithm – EURASIP Journal on Advances in Signal Processing

#### ByTao Wu, Honghui Fan, Hongjin Zhu, Congzhe You, Hongyan Zhou and Xianzhen Huang

May 7, 2022

The intrusion detection model involved in this paper selected machine learning algorithms such as random forest, which are commonly used in related studies. The performance of the classifier was improved by optimizing the random forest algorithm for similarity and combining it with data imbalance processing techniques. The overall architecture of the intrusion detection model is shown in Fig. 1, which includes the following processes:

1. (1)

The analysis of the NSL-KDD dataset revealed the imbalance in the samples of the network attack type dataset. This imbalance resulted in higher false detection rates and lower accuracy for detecting smaller-scale samples. Therefore, this paper proposed a combination sampling method by combining the K-means algorithm with the SMOTE algorithm. This approach can reduce the number of outlier samples, enrich the attribute features of the minority samples, and increase the sampling number of the minority samples to build more balanced sample data of the network environment.

2. (2)

To further reduce the computational overhead time and increase the detection performance, it was necessary to convert the nonnumerical features in the original dataset digitally and then convert the values to a specific range by normalization. This allowed dataset normalization and feature selection by information gain to filter out unnecessary features.

3. (3)

The classification model was trained by feeding the normalized processed dataset into the enhanced random forest algorithm. The whole process is explained in detail as follows: the traditional random forest model is initially constructed, and then the constructed model is evaluated based on the area under the curve (AUC) index for the decision tree performance. The decision tree with the best performance is selected by the abovementioned approach. Then, the decision trees with high similarity are filtered by calculating the similarity between them, and finally, the decision trees with low similarity and high performance are formed into an enhanced random forest model, and the activity similarity matrix is generated to determine the results of subsequent activities.

4. (4)

In the next section, correction and optimization of detection results are performed by calculating the similarity relationship between sample types of network attack data. The process started with a preliminary determination of the cybersecurity attack dataset by enhanced random forest and determined the type of attack by majority voting. In the next step, this paper made accuracy judgments were made based on the key features, and if the judgment results indicated that the activity type was not reasonable, the results were corrected based on the attack type similarity matrix.

5. (5)

A classification model with relatively good performance was obtained after enhanced random forest training, and the results were evaluated by conducting performance tests on the NSL-KDD dataset.

### Construction of a balanced dataset based on the K-means clustering algorithm and smote sampling technique

There are a large number of normal-type samples in the network attack dataset, while the number of abnormal samples is relatively small, which interferes with the classification performance in the process of detection model training. Such problems result in a classification model that performs well in terms of accuracy for majority classes of samples, but the accuracy of the minority classes may not be satisfactory, and the generalization ability of the overall classification model is relatively weak. Among the many sampling algorithms, the diversity of the training sample is maintained and the inherent characteristics of the sampled samples are preserved. Therefore, the SMOTE algorithm technique is used for the oversampling of minority class samples in this paper. By analyzing the minority samples, multiple minority samples are manually processed to generate new samples and added to the original dataset. This approach allows optimization of the network environment sample and minimization of the model overfitting problem. The main idea of the algorithm is explained in detail as follows:

1. (1)

For every sample x from the minority class sample, based on the Euclidean distance as the reference standard, the distance from this sample to other samples of the same type is calculated. The k nearest neighbors of this sample are obtained by the above operation.

2. (2)

By analyzing the number of samples in advance, a more reasonable oversampling rate N is determined. Based on the determined parameter N, the random selection operation of the number of samples from the k nearest neighbors obtained above is denoted as xn.

3. (3)

For the random nearest neighbor sample xn obtained by operation (2), the new sample points are constructed by performing the operation shown in Eq. (1) with the initial sample points in turn.

$$x_{{{text{new}}}} = x + {text{rand}}(0,1) times left| {x – x_{n} } right|$$

(1)

The SMOTE algorithm technique achieved some effect and improved the overfitting problem. Based on the traditional SMOTE algorithm, a series of improved algorithms have been proposed and have achieved better performance, including AE-SMOTE and SMOTE-ENN [13]. However, the analysis of the network security dataset revealed that the SMOTE algorithm still had certain problems in dealing with imbalance problems, such as handling outlier values. Related studies have addressed this type of problem by excluding such values a priori or by not considering outlier values; for example, this type of value was not handled in Borderline-SMOTE [14]. Therefore, in this paper, the interference of outlier points in the sample generation process was reduced by combining the K-means clustering algorithm with SMOTE, and the detailed steps of the improved SMOTE algorithm are shown as follows:

1. (1)

The minority samples data are analyzed, the number of clustered sample centers T is determined, and then the target samples are selected by K-means clustering based on this value.

2. (2)

For the target sample determined in the above step, a random sample is selected, and the k nearest neighbors of this sample in the target sample are calculated.

3. (3)

The samples are also selected from the k nearest neighbors based on the preanalyzed data and oversampling rate N is set. The mean value between all samples is calculated, and then a new sample is generated between that value and the neighboring samples following the steps shown in Eq. (2).

begin{aligned} x_{{{text{mean}}}} & = frac{1}{k}sumlimits_{i = 1}^{k} {x_{i} } \ x_{{{text{new}}}} & = x_{i} + {text{rand}}(0,1) times left( {x_{{{text{mean}}}} – x_{i} } right) \ end{aligned}

(2)

The overall flow of the improved sampling processing algorithm is shown in Fig. 2, and the figure is demonstrated using the binary classification problem. First, the number of outlier samples was minimized as much as possible by clustering through K-means, and then the obtained outlier samples were used in an optimized way for the subsequent new sample production process. Then, the mean sample of the neighboring points was calculated, and an attempt was made to use it as the center of the later sample clustering with the nearest neighbor sample to generate the new sample. The attribute characteristics of the new samples obtained in this way were richer and the number of outliers was relatively reduced compared with the traditional way, which was more beneficial for training the random forest classifier later. The overall sampling process is shown in Fig. 2.

### IDS based on enhanced random forest

The random forest algorithm has relatively high detection accuracy compared to other classification algorithms, and the algorithm is more tolerant of noisy samples, which has resulted in numerous theoretical and experimental studies focusing on the use of these algorithms. As a combinatorial classifier algorithm, by learning the basic idea of the bagging algorithm to obtain N bootstrap training samples with a put-back sampling of the original dataset, the disguised augmentation of the samples used for training can be achieved. This approach effectively reduces the probability of overfitting. The dataset obtained by the above operation is fed into a decision tree model for training, and the final model is combined to generate a forest classification model. The model predicts the results by majority voting. The overall flow of random forest is shown in Fig. 3.

However, there is still much room for improvement in the traditional random forest model. It includes improvements in the classification ability of each decision tree in the forest, further optimization of the correlation between decision trees in the combined forest model, and optimization of the voting method adopted in the result determination process. A relatively good combinatorial model needs to have the following characteristics: good decision-making ability within classifiers and a small correlation between classifiers. This paper optimized the random forest from the following aspects. First, the classifiers with excellent decision performance were selected by the area under the curve (AUC) index, then intertree optimization was performed by calculating the similarity between the decision trees, and finally, the result correction process was performed by determining the similarity between the network attack results.

The horizontal and vertical axes of the receiver operating characteristic (ROC) curve depict the proportion of predicted types that are consistent with positive samples and actual types, respectively. The value of the area under the curve (AUC) is the area between the ROC curve and the coordinate axis. It serves as the corresponding probability value, and it can indicate the superiority of the classifier performance by comparing the high or low value, which is the reason that it is a criterion for internal performance optimization. In addition, the structural similarity between classifiers can be further calculated based on calculating classifier nodes and branches. Based on this, the structural similarity between classifiers was further calculated, which could be roughly classified as similar (more than 80%) or dissimilar (less than 40%) by setting a certain threshold value as the judgment criterion. After the similarity comparison by the above steps, the values were transformed into matrix form, and the secondary optimization process was carried out according to the difference in classification and the AUC level. The classifiers with strong individual classification ability and low similarity were selected to form a classification model by combining them.

The details of the enhanced random forest model involved in this paper are explained as follows:

1. (1)

Analyzing and using the bagging algorithm, the original network security samples are selected and grouped randomly, the number of in-of-bag (IOB) samples for each group is W, and the obtained sample order set is as follows:

$$left{ {{text{IOB}}_{1} ,{text{IOB}}_{2} , ldots ,{text{IOB}}_{W} } right}$$

(3)

2. (2)

Based on the above-obtained training sample set (3), the corresponding optimal splitting attributes and candidate attributes are selected by random attribute selection and Gini index. After N rounds of model training, the corresponding classification model is finally obtained. By calculating and ranking the AUC values, the set of classification models with excellent classification performance is selected.

3. (3)

The acquired classification models are optimized based on the similarity, and the classifiers with high similarity and poor classification performance are emptied. The computational approach taken in this paper focuses on calculating and applying structural similarity, and this class of methods learned and borrowed from Bakirli’s [15] multitree intersimilarity optimization method. By analyzing and utilizing the decision tree storage structure form, the classifier is transformed and decomposed into the corresponding rule set and candidate rule set. The similarity ({text{similarity}}_{c1,c2}) between multiple trees is derived by comparing split nodes among them and generating the similarity matrix ({text{Matrix}}). The set of classifiers (4) with high similarity and poor overall performance is selected by setting the corresponding thresholds:

$$left{ {{text{classifier}}_{1} ,{text{classifier}}_{2} , ldots ,{text{classifier}}_{Q} } right}$$

(4)

4. (4)

Based on the classification combinations obtained by the above operation, the results are determined. The traditional rules for voting in the classification model are based on the majority voting principle, as shown in (5):

$$F(x) = arg mathop {max }limits_{forall T} mathop sum limits_{i = 1}^{Q} left( {{text{classifier}}_{Q} (x) = T} right)$$

(5)

In (5), F(X) denotes the combined classification model after the optimal selection shown above. ({text{Classifier}}Q , left( x right)) denotes the Q single classifiers in the combined classification model, and T denotes the label classification result.

In this paper, the voting session results were processed with corrections. Certain criteria needed to be established because the detection accuracy of the classification model was not completely accurate, and if there was a misclassification during the detection process, the method could be used to correct the detection results. Therefore, some activity rules needed to be predefined. The voting result F(X) was obtained after the majority voting of the result by the combined classifier, and the result of this determination was within a reasonable range if the attribute characteristics of the type were within the predefined activity rules. If the attribute characteristics of the activity did not match the set activity rules, it was necessary to calculate the similarity relationship between the decision results to generate the ({text{SimMatrix}}). Finally, a relatively more reasonable decision could be chosen based on the probability value ({text{SimMatrix}}) obtained. This operation focused on the following components:

1. 1.

Setting activity rules.

2. 2.

Generating the activity similarity matrix.

#### Setting activity rules

Through analyzing the NSL-KDD dataset samples, more critical features and candidate features were found, which could be used as the basis for setting the corresponding activity rules. The number of NSL-KDD attribute features was relatively large at 41, and the attack type labels were roughly divided into two categories, including normal and anomaly. The anomaly data types could be divided into four major categories, including denial of service attacks (DOS), probe, users to root attacks (U2R), and remote to local attacks (R2L).

To reduce the computational overhead time and reduce the false detection rate when analyzing and processing data of larger size and dimensionality, similar studies included optimization methods commonly used in Mohammadi [16], Selvakumar [17], and Staudemeyer [18], such as feature compression. After extensive experimental research by analyzing and processing the entire KDDcup99 dataset, Staudemeyer massively compressed the attribute features to 11, including duration, service, and other types. Based on this, by combining the decision tree classification method with correlation, the extracted features compressed the scale more efficiently compared to the previous methods.

On this basis, the inherent features of the dataset were selected through information gain. The attributes selected for the active rule setting included service, src_bytes, dst_bytes, hot, num_file_creations, dst_host_srv_count, and dst_host _same_src_port_rate.

After experimental comparison, it could be seen that the effect of service was relatively better; specifically, the set of service attributes of U2R included tftp_u, ftp_data, and gopher, pm_dump. The set of R2L attributes included telnet, ftp_data, ftp, other, http, imap4, and login. The set of service attributes for the rest of the data samples contained R2L.

Therefore, the following rules were established for the event. If the service attribute of the detection target was a subset of the set corresponding to R2L and did not belong to the dataset corresponding to U2R, the set Tser of result types was determined to be R2L, probe, DOS, and normal. If the service attribute of the detection target was not a subset to which R2L belonged, the set Tser of network attack types was distributed in the probe, DOS, normal. The specific explanation of the activity set where the process judgment was located is shown in (6):

$$T = left{ {begin{array}{*{20}c} {F(x)} & {F(x) in T_{{{text{ser}}}} } \ {mathop {max }limits_{j} {text{AcCorr}}_{{[{text{classifer}}(x)][j]}} } & {F(x) notin T_{{{text{ser}}}} } \ end{array} } right.$$

(6)

where F(x) is obtained by majority voting, x is the intrusion behavior characteristic, and Tser is the set of types obtained by the activity rule.

If the result F(x) obtained by the combined classifier is in Tser, the final invasion type is determined as F(x). In contrast, if the results do not match, the set of attack types in Tser and F(x) are calculated by similarity, and the one with the largest probability value in the set Tser of types is selected as the final classification result.

#### Generating the activity similarity matrix

To analyze the intrusion detection data samples, it was clear that the malicious network attack types had certain similar property operations, such as DOS, U2R, PROBE, R2L, and other attack types, which could lead to the reduction in Src_byte and dst_byte byte values. U2R- and R2L-type attacks could be detected by hot, num_failed_logins feature behavior, and malicious interactions had a strong correlation in time. Therefore, the analysis of the correlation between attack types could be performed in advance, and in this way, the correction operation of the random forest determination results was achieved. A large number of experiments for intrusion type detection were conducted in this paper, and on this basis, comparisons were made with the real type to generate the corresponding activity matrix in Fig. 4.

The corresponding target relationship probability values were obtained by the operations shown below:

$${text{sim}}_{a,b} = frac{{{text{time}}[a][b]}}{{mathop sum limits_{r = 1}^{n} {text{time}}[a][r]}}$$

(7)

In (7), (sumnolimits_{r = 1}^{n} {{text{time}}} [a][r]) denotes the total number of attack types determined as a after pre-experimental processing, and ({text{time}}[a][b]) denotes the proportion of attack types determined as b among the attack types determined as a obtained above.