# Machine Learning Algorithms for understanding the determinants of under-five Mortality – BioData Mining

#### ByRakesh Kumar Saroj, Pawan Kumar Yadav, Rajneesh Singh and Obvious.N. Chilyabanyama

Sep 24, 2022

This study’s methods have been explained step by step through a framework for under-five mortality prediction. The data analysis of this study was performed in various steps. Firstly, the multivariate logistic regression analysis was performed to find the important factors (p < 0.05) thereafter machine learning model’s approaches were applied to the dataset. The explanations of the machine learning frameworks are portrayed in Fig. 2. All the analyses of the data were conducted using Python 3.3, STATA 16.0, and SPSS-27 software.

### Importance of ML methods over traditional methods

A study has shown that a machine learning framework can be used to detect significant risk factors of under-five mortality and that deep learning techniques are superior to logistic regression for the classification of child survival [10]. Machine learning models can accurately predict neonatal, perinatal, and infant mortality [11,12,13]. Several studies done to predict the bankruptcy of banks have shown that intelligent techniques (specifically ANN) seem to work more effectively than statistical techniques. ANN and KNN methods perform more effectively than traditional methods [14].

### Dataset

National Family Health Survey (NFHS-IV) is a large-scale, multi-round cross-sectional, national representative survey conducted in households throughout the Indian states and union territories and is one of the most extensive data collection methods to help keep records across India. The reports are summarized from district-wise to state-wise. The survey collects extensive information on population, health, and nutrition, with an emphasis on women and young children. In this study, we have used secondary data from the NFHS-IV survey of Uttar Pradesh. We have used the target group data of under-five children of Uttar Pradesh. This dataset has records for every woman interviewed whose child was born in the past five years preceding the survey. It contains information related to the mother’s pregnancy, postnatal care, and health. This file was used to obtain information related to child health indicators such as immunization coverage, vitamin A supplementation, recent occurrences of diarrhoea, fever, and cough for young children, and treatment of childhood diseases. A total of 1377 variables were available in this dataset. There was a total of 41,751 samples/individuals, out of which under-five mortality was 2830.

### Study variables

According to an analytical framework for child survival in developing countries [15], we have used 19 (out of 1377 variables) most important variables that were related to under-five mortality, as most of the variables were not useful for this study. Due to missing values, only 15 variables were used for the analysis, which included the outcome/target variable. A missing value is defined as a variable that should have a response but does not have a response either because the question was not asked (due to interviewer error) or the respondent did not want to answer. The outcome/target (dependent) variable was under-five mortality which is known as the death of a child before completing 59 months.

The predictor (independent) variables considered in this study were mothers’ educational level, births in the last five years, any exposure, currently breastfeeding, total number of living children, wealth index, mass media exposure (MXP), survival time, the total number of children ever born, desire for more children, sex of the child, child-size at birth, ANC visits and birth order.

### Data pre-processing

After making the final dataset, the next step was to pre-process the data by using various methods. In this step, the duplicates and missing variables were removed using the predictive mean matching method. Thereafter, all string and categorical variables were transformed into numerical values.

An important point in data pre-processing is the need to balance the target or outcome variable. In the dataset, the numbers of under-five mortality were highly skewed as compared to live children (38,921 live children vs 2830 under-five mortality). A random over-sampling method was used to balance the target (dependent), after which a ratio of 50:50 was obtained as compared to the early ratio of 93:7.

### Feature selection

The idea of feature selection is about ranking the major risk factors from the dataset according to their importance. This is based on the calculation of the information gain values for each of the selected variables. In this study, we have used a random forest model to find the risk factors or important features that have a major contribution to child mortality. The higher information gain values tell us important variables that are highly correlated with the class of variable. We randomly selected the top eight ranked information values, which we used in the model building later.

### Model building

#### Data Splitting

In this step, we split the datasets into trained and test data. 70% of the trained data are used for the model classification and 30% of the data for model evaluation. Again, we will split the datasets into trained and tested (80% and 20% respectively) for a clear idea of a classification model. All the independent features needed to be changed in one-hot encoding to build better predictive models. In this study, the dependent variable was binary, i.e., dead/alive. We then used various suitable machine learning models, namely decision tree, random forest, Naïve Bayes, KNN model, logistic regression, SVM, neural network, and ridge classifier.

### Decision Tree (DT)

The decision tree is one of the most intuitive and straightforward techniques in machine learning based on the divide and conquers paradigm [16]. In a decision tree technique, tests (on input patterns) and categories (of patterns) are used as inner and leaf nodes, respectively. This technique also assigns a class number to an input array by filtering the array down via the tests in the tree [12].

### Random Forest (RF)

The random forest algorithm takes hyper-parameters, identifying the number of trees and the maximum depth of each tree. The random forest is a combination of learning approaches for the classification in machine learning and uses a vast collection of de-correlated decision trees [17].

### Support Vector Machine (SVM)

The SVM is a supervised machine learning technique for analyzing and recognizing patterns of data [18]. New observations are predicted based on class and the side of the partition they fall in. The SVM is the nearest data point to the hyperplane that divides the classes.

### Logistic Regression (LR)

Logistic regression is a statistical classification probabilistic model that predicts the probability of occurrence of an event. The logistic regression model is used to model the categorical dependent variable and a dichotomous categorical outcome or feature. It is a binary (multiple) model used to predict binary (multiple) responses [16]. The predictors need to be independent and significantly associated with the outcome variables [19].

### Naive Bayes (NB)

Naive Bayes is a simple machine learning algorithm based on the Bayes theorem, and it has a necessary assumption that the attributes are conditionally independent for the given class. Naive Bayes gives competitive classification accuracy [20]. Naïve Bayes is widely applied because of its computational efficiency and desirable features [21].

### K- Nearest Neighbours (KNN)

The KNN is a simple and effective non-parametric method of classification, and it is effective in many cases [22]. To classify the data record ‘t’, its ‘k’ nearest neighbour is collected, forming a neighbourhood ‘t’. Most points among the data records in the neighbourhood is mainly used to decide the classification for ‘t’ with or without consideration of distance-based weighting. While applying the KNN, we choose an appropriate value for ‘k’, and the classification success depends on this value. There are several methods of determining k values, but the simplest one is to run the algorithm many times with varying k values and choose the best performance [23].

### Neural network

Neural networks reflect the human brain’s behavior and allow computer programs to find patterns and solve common problems in machine learning, artificial learning, and deep learning. ANN comprises a node layer that contains an output layer, an input layer, and one or more hidden layers [24]. Each node connects to another and has an associated weight and threshold. If the output of an individual node exceeds the given threshold value, that node is activated and sends data to the next layer of the network.

### Ridge regression

Ridge regression is a method for estimating the multiple-regression models’ coefficients when the independent variables are highly correlated. This method was developed as a possible solution to the imprecision of least squares estimators with multi-collinearity among the independent variables in the linear regression model [25]. Ridge parameter estimates are more precise because their mean square error and variance are smaller than the least square estimators.

### Evaluation for predictive models

In this study, to predict the best model for under-five mortality, evaluation was conducted by various indices such as confusion matrix, sensitivity, specificity, precision, accuracy, F1 score, negative predictive value, Cohen’s Kappa values, and AUROC. All the details as given below:

### Confusion matrix

The confusion matrix visualizes the actual and predicted class accuracies [26]. To examine the performance of the classification algorithm, the confusion matrix compares the predicted classification versus actual classification through the measures; true positive (TP), false positive (FP), true negative (TN), and false-negative (FN), and the formulas are given below.

• True positive (TP) – The model correctly predicts positive class in the outcome.

• True negative (TN) –The model correctly predicts negative class in the outcome.

• False-positive (FP) – The model incorrectly predicts a positive class in the outcome.

• False-negative (FN) –The model incorrectly predicts a negative class in the outcome.

• Sensitivity – Sensitivity is the test to measure correctly positive predicted events out of a total number of positive events. This gives us the value of how many positives are predicted out of total positive classes. This is known as recall and can be calculated by the given formula:

$$mathbf{S}mathbf{e}mathbf{n}mathbf{s}mathbf{i}mathbf{t}mathbf{i}mathbf{v}mathbf{i}mathbf{t}mathbf{y}/mathbf{R}mathbf{e}mathbf{c}mathbf{a}mathbf{l}mathbf{l}=frac{mathbf{T}mathbf{P}}{mathbf{T}mathbf{P}+mathbf{F}mathbf{N}}$$

Specificity – Specificity is the measure that tells us the proportion of correctly predicted negative outcomes among all total negative outcomes. It can be calculated by the given formula:

$$mathbf{S}mathbf{p}mathbf{e}mathbf{c}mathbf{i}mathbf{f}mathbf{i}mathbf{c}mathbf{i}mathbf{t}mathbf{y}=frac{mathbf{T}mathbf{N}}{mathbf{T}mathbf{N}+mathbf{F}mathbf{P}}$$

Precision – Precision is the correct events divided by the total number of positive events that the classifier predicts. This is also known as positive predictive value. In this study, it was used to check the model output from the given formula below and it was calculated from the confusion matrix:

$$mathbf{P}mathbf{r}mathbf{e}mathbf{c}mathbf{i}mathbf{s}mathbf{i}mathbf{o}mathbf{n}/mathbf{P}mathbf{P}mathbf{V}=frac{mathbf{T}mathbf{P}}{mathbf{T}mathbf{P}+mathbf{F}mathbf{P}}$$

Negative predictive value – The negative predictive value is defined as the number of true negatives divided by the total number of people who test negative.

$$mathbf{Negative}boldsymbol;mathbf{predictive}boldsymbol;mathbf{value}boldsymbol;boldsymbol=frac{mathbf{TN}}{mathbf{TN}boldsymbol;boldsymbol+boldsymbol;mathbf{FN}}$$

Accuracy – Accuracy is the percentage of true events among the total number of cases tested. In this study, it was used to determine model efficacy and measure from the confusion matrix.

$$mathbf{A}mathbf{c}mathbf{c}mathbf{u}mathbf{r}mathbf{a}mathbf{c}mathbf{y}=frac{mathbf{T}mathbf{P}+mathbf{T}mathbf{N}}{mathbf{T}mathbf{P}+mathbf{T}mathbf{N}+mathbf{F}mathbf{P}+mathbf{F}mathbf{N}}$$

F1 score—The inverse relationship between accuracy and recall is the F1 score or the F test. The higher value of the F1 score predicts a better model. The harmonic mean of recall and accuracy is determined as.

$$mathbf F1;mathbf smathbf cmathbf omathbf rmathbf e=frac{2mathbf Tmathbf P}{2mathbf Tmathbf P+mathbf Fmathbf N+mathbf Fmathbf P}$$

Cohen’s Kappa—Cohen’s Kappa is a coefficient used to assess the performance of the binary classification model [27]. It is a very useful evaluation statistic coefficient when working with imbalanced data. Cohen’s Kappa (k) is calculated by the given formula:

$${varvec{k}}boldsymbol{ }=frac{{{varvec{p}}}_{{varvec{o}}}-{{varvec{p}}}_{{varvec{e}}}}{1-{{varvec{p}}}_{{varvec{e}}}}$$

where ({p}_{o}) is the overall accuracy of the model and is the measure of the agreement between the model predictions and the actual class values as if happening by chance? It can range from 0 to 1, with 0 representing no agreement and 1 representing the perfect agreement between classes.

### Area under Receiver Operator Characteristic (AUROC) Curve

The Receiver Operator Characteristic curve is the probability curve that shows the relationship between sensitivity and specificity. This curve is the most used metric for binary classification outcomes. The Field under the ROC shows how well the probabilities are segregated from the negative classes by the positive classes. When the AUC value is close to 1, the model prediction indicates better, while the value near 0 indicates bad model efficiency. In this study, we use this measure for the model’s efficiency.

### Precision-recall curve

The precision-recall curve is a combination of sensitivity (x-axis) and precision(y-axis). It’s used as an alternative to roc curves [28]. The high precision relates to a low false positive rate, while high recall is related to low false. The maximum area under the curve denotes both high precision and high recall. The highest score for both measures indicates that the classifier is producing results that are mostly positive (high recall) and accurate (high precision).