# Application of machine learning missing data imputation techniques in clinical decision making: taking the discharge assessment of patients with spontaneous supratentorial intracerebral hemorrhage as an example – BMC Medical Informatics and Decision Making

Jan 13, 2022

### Data source and preprocessing

The data in this study came from the database of Comprehensive Data Collection and Decision Support System for health statistics in Sichuan Province. This database includes the medical records of patients with spontaneous intracerebral hemorrhage in all general hospitals and community hospitals in Sichuan Province since January 1, 2017. In order to better explain the research problems and operability, the medical records of 2000 patients with spontaneous intracerebral hemorrhage who were admitted to the hospital until June 30, 2019 were randomly selected, and the cases with missing value were excluded. At the same time, the patients with supratentorial hemorrhage were selected as the research objects, and finally 1468 complete samples were included.

### Experimental design

Figure 1 showed the experimental design. The whole process consisted of four main steps: generating missing data by simulation, complete data set generation, performance evaluation and comparison, and statistical test.

#### Generating missing data by simulation

Missing mechanism, missing mode, missing proportion, data type of missing and requirements of processing method itself have impacts on the processing effect of missing data. It was comprehensively considered that this study created the corresponding missing scenario by setting the data missing mechanism of target variables, the proportion of missing and the ratio of missing proportion of each group. The target missing variable was set as the volume of supratentorial hemorrhage. The simulated missing data sets of different missing scenarios were artificially generated on the basis of the complete data.

According to the definition of Rubin DB [19], the data missing mechanism represents the relationship between missing of the target variable and other variables (including observed variables and unobserved variables) in the data set, which explains the reason for data missing. Particularly, it includes the missing completely at random (MCAR, the target variable independent of the observed and unobserved variables), missing at random (MAR, the target variable related to the observed variables) and missing not at random (MNAR, the target variable related to the unobserved variables). Because it is still difficult to simulate the MNAR mechanism, we set the missing mechanism as MCAR and MAR.

In the setting of MCAR mechanism, missing values were randomly generated. In the setting of MAR mechanism, we set the observed variable related to the target variable as the discharge situation. Specifically, the data set was split into two subsets according to the discharge situation, namely, the failure group and success group. We set the ratio of missing proportion of those two groups to 1:2 and 2:1, and controlled the total missing proportion of the two groups to the set proportion. In this way, the MAR (the ratio of missing proportion 1:2) and MAR (the ratio of missing proportion 2:1) mechanisms were formed to compare the effects of the ratio of missing proportion of each group on missing data processing methods under MAR mechanism.

Then, according to the possible missing situation in previous studies, the proportion of missing was set into six categories: 5%, 10%, 15%, 20%, 30% and 50% respectively, Finally, a total of 18 incomplete data sets corresponding to missing scenarios were generated by simulation.

#### Complete data set generation

Missing data processing techniques were applied to generate complete data sets by imputing missing values of the incomplete data sets of the previous step.

At present, there are three kinds of ideas in missing data processing, namely, deleting cases with missing values, weighting adjustment methods and missing values imputation. The imputation method is the mainstream of missing values processing. In view of the attention of missing data processing methods at present and comparing machine learning with traditional imputation, this study chose mode imputation (Mode) and KNN as the representatives of the traditional single imputation methods, multiple imputation by chained equations (MICE) as the representative of the traditional multiple imputation methods, and logistic regression (LR), RF, NN, support vector machine (SVM) and EL as the representatives of the machine learning imputation techniques.

##### Machine learning imputation

The missing data imputation methods based on machine learning usually use modeling to mine the effective information in the incomplete data, so as to reasonably infer the imputation values. The overall imputation idea of the following machine learning algorithms used in this study is to take the complete samples in the incomplete data set as the training set to establish the prediction model, and estimate the missing values according to the trained prediction model.

LR is one of the most commonly used and classic classification methods in machine learning [20]. It belongs to nonlinear regression, and is a multiple regression analysis method to study the relationship between the dependent variable with two or more classifications and some influencing factors. Because of its simplicity, easy implementation and maturity, it is widely used in classification problems.

RF proposed by Breiman L in 2001 is a derivative of ensemble learning Bagging algorithm [21]. The algorithm idea is as follows: the original data set is N, m samples are randomly sampled by Bootstrap method to form a training set which is repeated B times to obtain B training sets, and build B basic decision tree models. p features are randomly selected from all features, and then the best feature is selected from the p features according to the information gain for segmentation. Each decision tree is split until the training samples of all nodes belong to the same class, and pruning is not needed in the whole process. Generated B decision trees form a RF. This method not only pays attention to the performance of single decision tree classifier, but also reduces the correlation between each decision tree, improves the performance of combined classifier and increases the robustness of the algorithm to noise.

NN is a complex network system, in which neurons are connected with each other, and information is processed in parallel and converted nonlinearly by simulating the way of human brain nerve processing information. This study adopted the widely used back propagation neural network proposed by Rumelhart DE et al. in 1986 [22], which is a multilayer feedforward neural network trained by error back propagation algorithm. Back propagation neural network can learn and store a large number of input–output pattern mappings without revealing the mathematical equations describing the mappings in advance. Its learning rule is to use the steepest descent method to constantly adjust the weights and thresholds of NN through back propagation, to minimize the sum of squares of errors of NN. The most common three-layer back propagation neural network model was used in this study, including an input layer, a hidden layer and an output layer.

SVM was proposed by Vapnik V et al. [23]. It is designed for binary classification task, which can map linearly inseparable data to higher dimensional space and find a partition hyperplane with the largest interval in sample space based on training set to obtain decision function. By maximizing the margin between the two classes and minimizing the misclassification error, the samples of different classes are separated.

EL accomplishes the learning task by constructing and combining multiple learners, and often obtains better generalization performance than a single learner [24]. This study adopted the Stacking algorithm proposed by Wolpert DH in 1992, also known as Stacked Generalization [25]. Stacking combines multiple classification methods into a single model, which takes advantages of different machine learning methods and thus improves the accuracy of prediction. For stacking, it has two-stage learning model. The original data set is used to train the first stage models, which include multiple different classification methods. The second stage model is trained to combine the prediction results from first stage models to obtain the final results. In this study, LR, RF, back propagation NN and SVM with radial basis function were used as the first stage models. For the second stage model, SVM with radial basis function was chosen to learn the relationships from the first stage models automatically. The algorithm framework was shown in Fig. 2.

According to the classification performance (Area Under Curve (AUC)) of ten-fold cross validation, the hyperparameters for each model and each incomplete data set simulated were tuned and the optimal configuration was selected using the Grid Search method. For example, Table 1 showed the optimal hyperparameter configuration of machine learning imputation techniques in the MAR (the ratio of missing proportion 1:2) mechanism scenario with a missing proportion of 5%.

Mode is one of the simplest methods to impute missing value, which is to impute missing value with the mode of not missing value of each variable [26]. It is generally used for non-numerical variables.

KNN was first proposed by Cover T and Hart P in 1967 [27]. KNN realizes the imputation of missing values by mining the similarity between samples, which is to identify neighboring points by distance measurement, and then estimate missing value by using the complete values of neighboring points. Specifically, we can calculate the distance between a missing value and other complete values, find its k (k = 10) nearest distance data by using the defined function of distance between measured data (Euclidean distance), and then use the median of these k data to impute this missing value.

MICE is essentially a series of regression models, originally proposed by Boshuizen HC and Knook DL [28]. The missing values of each variable will be predicted according to other variables in the data, and repeated before the estimated value fully converges. At the same time, the whole process will be repeated m times, that is, after m times modeling and analysis, m different estimated values are generated for each missing value to form m complete data sets, and finally these m results are integrated according to certain rules to form the final missing value imputation result. This study adopted predictive mean matching with iterated 50 times to impute missing data 20 times repeatedly, and the average results of 20 times were integrated as the final imputation values.

#### Performance evaluation and comparison

In order to evaluate the impact of clinical decision-making results, the logistic regression models were constructed to evaluate the performance of missing data processing techniques. The discharge situation (failure (n = 261) = 1, success (n = 1207) = 0) as dependent variable and the other variables as independent variables, using the medical records of patients to assess their discharge. The imputation effects of missing data processing methods were evaluated by calculating the sensitivity, AUC and Kappa values of the models, which all ranged between 0 (the worst) and 1 (the best). The evaluation metrics values of original complete data were used as references.

The sensitivity reflects the extent to which the model can cover the concerned categories, that is, the proportion of patients correctly classified whose discharge situation are failure. It was calculated as shown in formula (1), where TP and FN denote true positives and false negatives, respectively.

$$Sensitivity = frac{TP}{{TP + FN}}.$$

(1)

Because clinical decision making such as discharge assessment requires the prediction model to have high sensitivity, that is, to predict the failure of discharge as much as possible to avoid serious consequences in this study, specificity was not regarded as a separate metric for evaluation, and AUC was used to comprehensively reflect the accuracy combining sensitivity and specificity. The AUC can be acquired by calculating the area under the Receiver operating characteristic (ROC) curve plotting the true positive rate (sensitivity) against the false positive rate (1-specificity) over a range of cut-off values, which represents a trade-off between sensitivity and specificity.

The AUC represents accuracy, while the Kappa represents reliability, which is used to assess the consistency between the model results and the actual results. The Kappa was calculated as shown in formula (2), where po and pe are the observed and expected by chance alone proportions of agreement, respectively.

$$Kappa = frac{{p_{o} – p_{e} }}{{1 – p_{e} }}.$$

(2)

This study compared the performance of imputation techniques from two aspects: processing effects of different methods in each missing scenario and each method in different missing scenarios.

#### Statistical test

In order to evaluate whether the observed performance differences between different methods under different missing scenarios were statistically significant, the Wilcoxon signed-rank test was adopted. Due to multiple comparisons between multiple methods, the false discovery rate (FDR) method was used to adjust the P values. The statistical test level was 0.05.

In this study, R 4.0.1 software was used for data analysis. The packages used by traditional imputation included DMwR2 and mice, while packages used by machine learning imputation were shown in Table 1.