Comparison and analysis of Iscore and MGR
Suppose that there is a design matrix X with dimension of (ntimes p), where n is the number of samples and p is the number of predictor variables (left{ X_1,X_2,ldots ,X_pright}). In addition, every sample has an outcome variable Y. So it is of interest to identify main predictor variables as well as their interactions based on the design matrix and outcome variable. Here we make use of two influence measures including the existing Iscore and our proposed MGR separately to assess the importance of predictor variables and their interactions. Without losing generality, we assume that the outcome variable Y is binary with value 0 and 1. At the same time, all predictor variables are discrete. Consider the partition (P_k) generated by a subset of k predictor variables (S_b=left{ X_{b_1},ldots ,X_{b_k} right}), (1le b_1le b_2le cdots le b_kcdots le p). If all predictor variables in the subset are binary then there are (2^k) partition elements in partition (P_k). Let (n_1(j)) be the number of observations with (Y=1) in partition element j where (jin P_k). Let ({bar{n}}_1(j)=n_jtimes pi _1) be the expected number of (Y=1) in element j under the null hypothesis that the subset of predictor variables has no association with Y, where (n_j) is the total number of observations in element j and (pi _1) is the proportion of (Y=1) observations in the sample. The Iscore of (left{ X_{b_1},ldots ,X_{b_k}right}) is defined as [23]
$${I}(S_b)=sum _{jin P_k}left[ n_1(j){bar{n}}_1(j)right] ^{2}.$$
(1)
Gain Ratio with onedimension predictor variable X is as follows [24]:
$$mathrm{GR}(X)=frac{Gain(X)}{SplitInfo(X)}=frac{Info(Y)Info _X(Y)}{SplitInfo(X)},$$
(2)
where Gain(X) is the information gain provided by the predictor variable X in determining the outcome variable Y, SplitInfo(X) is the selfinformation of predictor variable X. Because the predictor variables’ classification number affects the results, Quinlan uses the selfinformation of predictor variable to divide information gain to correct the deviation of variables’ classification number [24]. Combined with the above hypothetical scenarios, we extend the onedimension Gain Ratio to the multidimension case, which is Multivariate Gain Ratio:
$$begin{aligned} mathrm{MGR}(S_b)&=frac{Gain(S_b)}{SplitInfo(S_b)}\&=frac{Info(Y)Info _{S_b}(Y)}{SplitInfo(S_b)}, end{aligned}$$
(3)
where (Info(Y)=pi _1log(pi _1)(1pi _1)log(1pi _1)),
(Info _{S_b}(Y)=sum _{jin P_k}(frac{n_1(j)}{n}log(frac{n_1(j)}{n_j})+frac{n_jn_1(j)}{n}log(frac{n_jn_1(j)}{n_j})),)
(SplitInfo(S_b)=sum _{jin P_k}frac{n_j}{n}log(frac{n_j}{n})).
Info(Y) measures the average amount of information needed to identify the outcome variable of a case. (Info _{S_b}(Y)) is the needed average information to identify the outcome variable of a case when the outcome variable Y is partitioned in accordance with the subset (S_b). Then (Gain(S_b)) measures the information gained by partitioning outcome variable Y in accordance with subset (S_b). (SplitInfo(S_b)) is the selfinformation of predictor variables (S_b), which represents the potential information generated by dividing the samples into (2^k) subsets. Because (Gain(S_b)) has a strong bias in favor of (S_b) with many classification numbers, we need to use (SplitInfo(S_b)) as a kind of normalization to rectify (Gain(S_b)). Then, (mathrm{MGR}(S_b)) expresses the proportion of information generated by the split and useful for classification. GR is an influence measure which could be used to detect important single biomarker variable. However, if a biomarker plays an important role in disease detection when combined with other biomarkers, it will be missed by using GR method. The following simulated example could illustrate the advantages of MGR in determining biomarker interactions. We generate a predictor variable matrix (X_{200times 10}) by randomly uniformly sampling from (left{ 0,1right}), where 200 is the number of samples and 10 is the number of predictor variables. The outcome variable Y is related to (X_{200times 10}) via the model
$$begin{aligned} Y=left{ begin{aligned} X_1(module2); with; probability; 0.5\ X_2+X_3(module2); with; probability; 0.5 ;. end{aligned} right. end{aligned}$$
Table 1 is the GR value of ten predictor variables from high to low. GR couldn’t select important single variable such as (X_2). However, from the simulated setting, (X_2) is an important variable in determining Y when combined with (X_3). The MGR of interaction (left{ {X_2,X_3} right}) is (84.4times 10^{3}), which illustrates that (left{ {X_2,X_3} right}) could be detected by MGR. MGR does not require one to specify a model for the joint effect of (left{ {X_{b_1},ldots ,X_{b_k}} right}) on Y. It is designed to capture the information provided by (left{ {X_{b_1},ldots ,X_{b_k}} right}) for determining Y. The property makes MGR a useful measure. We carry out two experiments in order to explore the nature of Iscore and MGR. In the first experiment, we use Leukemia Dataset [20] to explore if there are something in common between Iscore and MGR without loss of generality. We repeat sampling p variable(s) from 3571 predictor variables 500 times ((p=1, 2,ldots ,9)). For example, when (p=1), we sample one variable as an interaction from 3571 predictor variables 500 times. We document Iscore and MGR of the interaction as a numerical pair. Then we get the scatter plot Values of Cluster with 1 variable as Fig. 1a.
Due to the experiment results, Iscore and MGR are consistent in the nature of growth. Next, we are ready to explore the difference between the two influence measures. According to Fig. 1a–i, the correlation of Iscore and MGR varies with the number of variables in interaction. If we want to compare the importance of interactions with different number of variables, the influence measure must meet the demand that it’s unbiased for interactions with different number of variables. Then the value of influence measure stands for the importance of interactions completely rather than the interference of the number of variables. We continue to find out the law that Iscore and MGR separately vary with the number of variables in interaction. Every time we randomly generate interaction containing p predictor variable(s) and outcome variable Y, (p=1,2,ldots ,9). Under this setting, outcome variable is independent of every predictor variable interaction. For fixed variable number p, we generate a predictor variable matrix (X_{10{,}000times p}) with dimension (10{,}000times p) by randomly uniformly sampling from (left{ 0,1right}) and outcome variable with dimension (10{,}000times 1) by randomly uniformly sampling from (left{ 0,1right}). Then record the corresponding value Iscore and MGR of the interaction. Finally, we repeat the whole process 20,000 times. We draw the boxplot among the 20,000 interactions for every fixed p. Finally we get the results that the two influence measures vary with number of variables separately shown as Fig. 2.
We could explore the trend of Iscore and MGR varying with the number of variables in interaction by the boxplot Fig. 2. Because every predictor variable and outcome variable are generated randomly and independently, the interaction with different predictor variables should be independent of outcome variable. Therefore, the two influence measures should not significantly vary with the number of predictor variables in interaction, which is the basis of using the two influence measures to evaluate the importance of interaction with different number of variables. However, Fig. 2 shows that the two influence measures both increase along with the number of variables. But the specific growth situation of them is different. When the number of variables is between 1 and 9, Fig. 2a shows that Iscore varies with the number of variables in logarithmic function form, while MGR is in the form of the exponential function as shown in Fig. 2b. With the increase of variables, Iscore increases rapidly, then the rate of increase slows down, and finally the values tend to be stable. While the increasing performance of MGR is opposite to that of Iscore. When there is key interaction with a relatively small number of predictor variables in the datasets, the selected key interaction based on Iscore contains other unimportant variables while carrying out BDA. Therefore we can come to a conclusion that MGR performs better in the case of interaction with a relatively small number of variables, and Iscore is more suitable for interaction with more variables. Based on the diagram, the results of MGR are more credible than Iscore when the number of predictor variables in interaction is less than five.
Algorithms
In the first part of this section, we introduce BDA which is a repeated sampling algorithm used to find out the interactions. BDA is a greedy algorithm aiming at iterating over all the variables. Every time an initial predictor variable subset is selected, then delete variables one by one and finally find out the subset with maximum influence measure. The details are as follows.

(1)
Training set: Consider a training set ({(y_1,x_1),ldots ,(y_n,x_n)}), where (x_i=(x_{1i},ldots ,x_{pi})) with the dimension of p and the number of samples n. Typically p is very large. Outcome variable and all predictor variables are discrete.

(2)
Sampling for initial subset: Select k predictor variables (S_b=left{ X_{b_1},ldots ,X_{b_k} right}), (b=1,ldots ,B) as initial subset. B is the number of repetitions which will be introduced in step 3.

(3)
Compute the influence measure of the initial subset.

(4)
Drop variables: Tentatively drop each variable in (S_b) in turn and recalculate the influence measure of the subset with one variable less. Then drop the one that gives the highest influence measure. Keep this retained pending combination as (S’_b).

(5)
Return set: Continue the next round of dropping on (S’_b), until only one variable is left. Now we have k retained pending combinations with variable numbers ranging from k to one. Keep the subset that yields the highest influence measure among the retained pending combinations. Refer to this subset as the return set (R_b). Keep it for future use.
For ease of understanding, we construct an example as shown in Fig. 3 to illustrate how BDA is conducted in detail. In this example, we tentatively drop one biomarker from (left{ 1,2,3,4,5right}) in turn, and document the influence measure of the interaction with four remaining biomarkers. The interaction with the largest influence measure is put into the retained pending combination pool. Then drop every biomarker of it. Repeat the above process until only one biomarker is left. Finally we choose the retained pending combination with the largest influence measure as the return set. When the number of predictor variables is small enough, we can obviously use the BDA directly with the influence measure such as Iscore or MGR to find interactions. If the predictor variables we meet are in tens of thousands, we must do some preparation before using BDA. We propose an algorithm to select important interactions in detail and construct the classifier of disease detection. The algorithm consists of five steps as shown in Fig. 4.
Step 1: Preprocessing
With the increase of the amount of biological data, searching for more effective computing algorithms will become an indispensable goal. Therefore, we need to apply preprocessing algorithm to help us preliminarily select valuable predictor variables. And we propose a preprocessing method in this paper, which could save the time of finding the appropriate preprocessing method. Preprocessing mainly includes four substeps: data cleaning; preliminary dimension reduction; data discretization; judgement of candidate method.

(1)
Data cleaning: The original dataset is transformed to (ntimes p) dimension, in which n stands for the number of samples, and p is the number of predictor variables. The sample status also needs to be prepared, which is represented by variable Y.

(2)
Preliminary dimension reduction: In this substep, we need to make sure if every predictor variable is meaningful in determination of variable Y. There are many methods such as ttest and Foldchange [25] aiming at selecting variables preliminarily. We choose the appropriate method to make it under different situations. We filter the biomarkers preliminarily, which is beneficial for our program later.

(3)
Data discretization: There are many discretization methods such as equal width method, equal frequency method, clustering method [26]. The most appropriate discretization method varies through different backgrounds.

(4)
Judgement of candidate method: We consult literature based on the background of the dataset to identify candidate methods about preliminary dimension reduction and data discretization. Then do some preexperiments to determine the best candidate. For example, after dimension reduction and discretization, we carry out important interaction extraction experiments by BDA based on randomly chosen one hundred predictor variables with discrete values. If the accuracy of the classifier constructed by step 4 and step 5 based on these selected important interactions is reasonable, we carry out experiments on the whole dataset with the candidate preprocessing method.
Our proposed preprocessing method saves computation time and improves efficiency which is benefit for step 2 and step 3. If the number of predictor variables in the dataset is small enough, then step 2 can be omitted to get the important interactions directly with BDA.
Step 2: Dimension reduction by interacted triples
When the number of predictor variables is in the thousands or even larger after preprocessing, we must screen the predictor variables before applying BDA. We take interaction effect into consideration in the process of screening. We realize the dimension reduction by 3way interactions (triples) [23]. This step consists of two substeps including obtaining top 3way interactions and top variables.

(1)
Determination of top 3way interactions: Firstly, sort the triples from high to low according to their respective influence measure. Then pick a triple out every 1000 triples from the beginning of the ordered sequence. Typically, the second difference of sequence consisting of every thousandth triple’s scores declines sharply in the beginning and stabilizes around zero later. We will choose top 3way interactions according to the second difference scores of thousandth near to zero for the first time. And the top thousands of triples are retained for further analysis.

(2)
Determination of top variables: After the above substep, each variable in the highscored triples may occur in multiple triples. Because highfrequency predictor variables are more likely to form influential predictor variable interactions, we select the important predictor variables by their retention frequency in top triples. We order the retention frequency of every predictor variables in the highscored triples from high to low. And the first difference of retention frequency sequence usually shows big drops in the beginning and stabilizes around zero later. We choose the cutoff value when the first difference of retention frequency sequence is stable around zero.
Then we get key predictor variables that have high potential to construct important interactions when combined with each other. And the dimension of predictor variables decreases significantly from thousands to hundreds or dozens, which is beneficial for next step.
Step 3: Generation of interactions based on BDA
In this step, we realize BDA based on the selected variables in step 2. There are two quantities including the size of the initial subset and the number of repetitions to be determined before applying BDA. The size of the initial subset can be calculated by Poisson approximation [23]. Then the minimum requirement is met if the initial size k satisfies
$$frac{n^2}{2m_{k1}}ge 1,$$
(4)
where n is the number of training samples, and (m_{k1}) is the number of partition elements induced by a subset with (k1) variables.
The approximation is adequate if the averaged number of observations per partition element is at least 4 [27]. So we can get the lower bound of the initial size. In practice, any initial size between the upper and lower bounds can be used. The number of repetitions in BDA is the number of variable subsets subjected to backward dropping. Assuming p is the number of predictor variables, and k is the initial size. If we want to cover all interactions with z variables ((zle k)), it is shown that we are expected to have
$$hat{B} approx left[ {{{left( {begin{array}{*{20}c} p \ z \ end{array} } right)} mathord{left/ {vphantom {{left( {begin{array}{*{20}c} p \ z \ end{array} } right)} {left( {begin{array}{*{20}c} k \ z \ end{array} } right)}}} right. kernnulldelimiterspace} {left( {begin{array}{*{20}c} k \ z \ end{array} } right)}}} right]logleft( {begin{array}{*{20}c} p \ z \ end{array} } right).$$
(5)
Finally (2{hat{B}}) can be used as an upper bound which covers all key predictor variables [23]. Then we get return sets by applying BDA combined with two influence measures separately. And the return sets will undergo two filtering procedures. The first procedure is to filter out return sets with overlapping variables. And we shall keep only one of those return sets containing common variables. This can be done by sorting the return sets in decreasing order according to the influence measure and then removing those having variables in common with a higherscored one. The return sets after removing overlap ones are subjected to a forward adding algorithm to remove false positives [23]. In order to use interactions selected from the above steps to determine the status of Y, we apply the following steps to construct the final classifier.
Step 4: Construction of the subclassifier
We choose Ridge Regression method with crossvalidation to construct subclassifier based on interactions generated above. We include all product terms between variables from one interaction as the joint action of the variables to construct each Ridge Regression subclassifier. Ridge Regression is appropriate because it alleviates the impact caused by colinearity between predictor variables in one interaction.
Step 5: Construction of the final classifier based on Boosting
The subclassifiers are combined together to construct the final classifier by Boosting algorithm [23], which could greatly improve the performance of the final classifier.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
Disclaimer:
This article is autogenerated using RSS feeds and has not been created or edited by OA JF.
Click here for Source link (https://www.biomedcentral.com/)