The main motivation of this study was to apply a suitable feature selection technique for imbalanced data and make a comparison of the performance of the ML models described in “Machine learning methods under scrutiny” section  by training those over the datasets summarized in “Discussion about experimental datasets” section. The diagram of the proposed approach has been illustrated in Fig. 7.

Fig. 7

Proposed ML-based IDS framework

Data preprocessing

The first step of the experiment was data preprocessing or data engineering. Feeding the models raw data may give a misleading prediction sometimes. As the datasets do not have any null or duplicate value, the data preprocessing step included only Categorical Feature Encoding, Feature Selection, and Feature Scaling.

Coding for categorical feature

The UNSW-NB 15 dataset contains three categorical features: State, Proto, and Service, and the Network TON_IoT Dataset has nineteen string features as shown in Table 3. To encode the categorical features of UNSW-NB 15, Response Coding was applied, and to transform the string features of Network TON_IoT dataset Label Encoding was used. In Label Encoding, each categorical feature is given a unique integer value based on alphabetical order (Sethi 2020). On the other hand, Response Coding is an encoding technique, which represents the probability of data instances that belongs to a specific class given a category. For N-class classification, N new features are generated for each category that inserts the probability (P) of data instance belonging to each class based upon the value of categorical data (Dharmik 2019). A mathematical expression can be presented by:

$$P( Y | A ) = Pleft( {A cap {text{ Y}}} right)/Pleft( A right)$$


where Y is denoted as a class and the category of feature is represented as A. It requires considerable memory space that made it unsuitable for transforming string features of the Network TON_IoT dataset, as it contains several categorical features.

Feature scaling

The datasets contain some features which have highly fluctuated magnitudes, ranges, and units. Due to this highly varying characteristic of the datasets, some algorithms such as the Multilayer Perceptron predicts wrongfully. Besides, it requires high computational resources. So, feature scaling is one commonly used method that normalizes the range of independent variables. Feature scaling is mandatory for some machine learning models which calculate the distance between data. Normalization and Standardization are two widely applied feature scaling methods in machine learning. In this experiment, Normalization or Min–Max scaling was used to normalize the data between the range of 0 and 1. The general formula of normalization can be shown as below:

$$f^{ – } = left( {f – {text{min}}left( f right)} right)/left( {{text{max}}left( f right) – {text{min}}left( f right)} right)$$


where (f) is the original feature and (f ^{ – }) is the normalized feature. The minimum and maximum values of the feature are represented as ({text{min}}left( f right)) and ({text{max}}left( f right)) respectively.

Gini Impurity-based Weighted Random Forest (GIWRF) for feature selection

Random Forest (Breiman 2001) is an ensemble classifier that is built on a number of decision trees and supports various feature importance measures. One of those is to derive the importance score by training the classifier. The traditional machine learning classification algorithm expects that all the classes in the training set come up with similar importance, and the models are built without considering that there may exist an imbalance class distribution in training data (Krawczyk 2016). To understand the relevance of features with the output of the imbalanced data, this feature selection technique employed a weight adjustment technique in RF once the classifier measured the Gini impurity, i(τ). Gini impurity reveals how well a split is to divide the total samples of binary classes in a specific node. Mathematically it can be written as:

$$ileft( tau right) = 1 – p_{p}^{2} – p_{n}^{2}$$


where (p_{p}) is the fraction of positive samples and (p_{n}) is the fraction of negative samples out of the total number of samples (N) at node τ. The reduction in Gini impurity deriving from any optimal split (Delta i_{f} left( {tau , M} right)) is gathered together for all the nodes (tau) in the M number of weighted trees in the forest, individually for all of the features. Mathematically it can be written as:

$$I_{g} left( f right) = mathop sum limits_{M} w_{p, n} mathop sum limits_{tau } Delta i_{f} (tau , M)$$


where (I_{g}) is the Gini importance, which specifies the frequency of a particular feature (f) being selected for the split and the significance of the feature’s overall discriminative value for the binary classification task. Assigning the weight (w_{p,n}) defines the imbalanced class distribution in the learning algorithm. The weight adjustment can be written as:

$${text{weight for positive class}}, w_{p} = frac{{n_{n} }}{N}$$


$${text{weight for negative class}},w_{n} = frac{{n_{p} }}{N}$$


(w_{p} + w_{n} = 1) and for imbalanced class data (w_{p} ne w_{n}).

The number of negative instances is represented as (n_{n}) and the positive instances are denoted as (n_{p}). N is the total number of instances in the training dataset. The pseudo-code for selecting the features using the sklearn library has been explained in Algorithm 1. Considering the threshold as 0.02 and 0.030 for UNSW-NB 15 and Network TON_IoT respectively maximum accuracy was observed as shown in Fig. 8, and the selected features for both datasets have been listed in Table 4.

Fig. 8

Accuracy versus threshold value for UNSW-NB 15 and Network TON_IoT dataset

Table 4 Selected features of UNSW-NB 15 and Network TON_IoT dataset

Machine learning methods under scrutiny

In this section, a summary of ML techniques that were applied to the experimental analysis has been discussed. The methods were selected in such a way that a comparison between traditional ML technique (DT), Ensemble method (Adaboost, GBT), Artificial Neural Network (MLP), and Deep Neural Network (LSTM, GRU) would be performed to detect intrusions over two imbalanced datasets.

Decision tree

Decision Tree (Quinlan 1986) is a non-parametric supervised machine learning model which is implemented for both classification and regression tasks. It repeatedly splits the data following a specific attribute. Decision Tree learns the decision rules that are deduced from the data attributes. Based on that rules, it predicts the value of the target variable. The decision-making process of this model can be shaped like a tree and makes it easier for the user to interpret. Numerous ML tools are available to visualize the output of DT (Safavian and Landgrebe 1991). Two units: decision nodes and leaves are the fundamental concepts of DT. In the decision node, data is split based on a particular parameter, and in the leaves unit, the outcome or decisions are obtained. However, the splitting criterion in this experimental study was entropy (Shannon 1948), which measures the impurity of the split. For every single internal decision node of the Decision Tree, the entropy equation can be given by the following formula:

$$Entropy, E left( j right) = – mathop sum limits_{k = 1}^{j} P_{k} log_{2} P_{k }$$


where j is the number of unique classes in the dataset and (P_{k}) is the probability of each particular class. For binary classification the entropy equation can be written as:

$$Entropy, E = – P_{p} log_{2} P_{p} – P_{n} log_{2} P_{n}$$


where (P_{p}) is the probability of positive event and (P_{n}) is the probability of the negative event.

Adaptive boosting (AdaBoost)

AdaBoost algorithm was firstly introduced by Schapire (2003). It is an ensemble learning method, which utilizes an iterative approach to correct the mistakes of weak learners. It calls a given base learning algorithm or a weak learner repeatedly in a series of rounds to boost the model’s performance. The fundamental concept of AdaBoost is to reassign the weights to each instance, and giving higher weights to the misclassified instances. In brief, while training the Adaboost model, firstly it trains the base classifier (such as DT) and utilizes that classifier to predict over the training set. Then, increasing the weight of incorrectly classified training instances, it trains the second classifier, using the newly updated weights, again it makes a prediction on the training set. Then, again updates the weights of instances, and so on. This process will be continued until it reaches the very last base learner.

Gradient boosting tree (GBT)

Gradient Boosting Tree (Mason et al. 1999) is a widely used machine learning model that identifies the shortcomings of weak learners and overcomes their limitations by boosting gradient descent with each weak learner in the loss function. The Loss function is calculated from the difference between the true value and estimated value. It transforms the weak learners into stronger ones by adding up the predictors to the ensemble, sequentially as well as gradually. Each predictor in the ensemble corrects the mistakes made by its predecessor. GBT is an effective technique to reduce noise, variance as well as bias (Felix and Sasipraba 2019).

Multilayer perceptron (MLP)

MLP (Rosenblatt 1961) is a feed-forward Artificial Neural Network (ANN) that passes the information from input to output in a forward direction. As the name suggests, Multilayer Perceptron contains three layers of nodes: Input Layer, Hidden Layer, and Output Layer. However, in MLP each node except the input unit is a neuron, and these neurons utilize a nonlinear activation function to transform the weighted sum of input into an output. The input unit receives the input signal, and the desired task of regression or classification is conducted in the output layer (Abirami and Chitra 2020). A mathematical expression of MLP can be written as the following equation:

$$y = emptyset left( {mathop sum limits_{i = 1}^{k} w_{i} x_{i} + c} right)$$


where each neuron estimates the weighted sum of the inputs k, then adds a bias c, and after that uses an activation function ((emptyset)) for producing the output y. MLP uses back-propagation to train the neurons.

Long Short-Term Memory (LSTM)

LSTM (Hochreiter and Schmidhuber 1997) is a type of artificial Recurrent Neural Network (RNN) that is adapted to store the information for a longer period. It was developed to handle the vanishing gradient problem, which can be experienced by RNN during the training phase. An LSTM unit is a composition of a memory cell, and three gates: input, output and forget gates to control the flow of information towards the memory cell or out of the cell. The input gate takes the decision of updating the cell state, the forget gate determines what information would be kept or discarded based on estimating a coefficient computed by the input data and the earlier hidden state. Based on the previous hidden state and the input data, the output gate highlights which information should be conveyed to the next hidden state. Mathematically, if oj (t) is an output gate and sj (t) denotes the cell state of the jth LSTM unit, the equation for the hidden state (h^{j} left( t right)) at any time t can be represented as follows:

$$h^{j} left( t right) = o^{j} left( t right) .tanh left( {s^{j} left( t right)} right)$$


where tanh is the activation function.

Gated Recurrent Unit (GRU)

Gated Recurrent Unit (Cho et al. 2014) is considered as a variation or a lightweight version of LSTM, but it is simpler to implement than LSTM. It has two gates: The update gate and Reset gate, which have been introduced to solve the vanishing gradient problem of RNN. The update gate is similar to the input and the forget gate used in the LSTM, which helps the model to decide what information should be passed to the next state. On the other hand, the Reset gate determines how much past information should be forgotten. Mathematically, the hidden state of jth GRU unit can be represented as a linear interpolation between the previous activation at time t and the candidate activation (widetilde{{h^{j} }})(t + 1), where (z^{j} left( {t + 1} right)) is the update gate and makes a decision regarding the updating amount of the candidate activation:

$$h^{j } left( {t + 1} right) = left( {1 – z^{j} left( {t + 1} right)} right)h^{j} left( t right) + z^{j} left( {t + 1} right) widetilde{{h^{j} }}left( {t + 1} right)$$


Evaluation metrics

The goal of this experiment was to increase the number of correct predictions in the test set for binary classification. To evaluate the performance of ML-based IDS several performance metrics are used in Machine Learning, such as Accuracy, False Positive Rate, Precision, Recall, etc. (Kumar 2014). However, the Accuracy (AC) score is the most commonly used metric to evaluate the performance of models in binary classification problems. It can be defined as below equation:

$$AC = frac{{left( {TP + TN} right)}}{{left( {TP + TN + FP + FN} right)}}$$


where TP stands for True Positive, and TN represents the count of True Negative. True Positive is the number of correctly detected attack instances. True Negative counts the correctly classified normal instances. On the other hand, False Positive (denoted as FP) is the number of legitimate traffic that is misclassified as an attack. False Negative (denoted by FN) counts the number of attack instances that are wrongfully considered as normal instances. In the case of detecting intrusion, less FN is always expected, as it is more hazardous than FP. However, for imbalanced class data, especially, where the aim is to efficiently detect the intrusion (positive instances) four other metrics: False Positive Rate (FPR), Precision, Recall, and F1 score are also considered in addition to the Accuracy measure.

  1. 1.

    False Positive Rate (FPR) The FPR is measured as the ratio of the negative events that are misclassified as positive (FP) and the total amount of truly negative events. The expression can be given as below:

    $$False Positive Rate left( {FPR} right) = frac{FP}{{left( {FP + FN} right)}}$$


  2. 2.

    Precision Precision is the measure that evaluates a model’s performance by calculating how often the model’s prediction is correct when it positively predicts an instance. Mathematically, it can be expressed as below equation:

    $$Precision = frac{TP}{{left( {TP + FP} right)}}$$


  3. 3.

    Recall/ Detection Rate (DR) It is the measure of the Machine Learning model correctly detecting True Positive instances. Also, Recall measures how accurate the model is to identify relevant data. This is why it is referred to as Sensitivity or True Positive Rate. The mathematical expression can be shown as below:

    $$Recall/DR = frac{TP}{{left( {TP + FN} right)}}$$


  4. 4.

    F1 Score It is the tradeoff between Recall and Precision that takes both FP and FN into account and measures the overall accuracy of the ML model. F1 score can be expressed as below equation:

    $$F1 score = frac{{left( {2*Recall*Precision} right)}}{{left( {Recall + Precision} right)}}$$


Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit


This article is autogenerated using RSS feeds and has not been created or edited by OA JF.

Click here for Source link (