# Biomarker interaction selection and disease detection based on multivariate gain ratio – BMC Bioinformatics

May 12, 2022

### Comparison and analysis of I-score and MGR

Suppose that there is a design matrix X with dimension of (ntimes p), where n is the number of samples and p is the number of predictor variables (left{ X_1,X_2,ldots ,X_pright}). In addition, every sample has an outcome variable Y. So it is of interest to identify main predictor variables as well as their interactions based on the design matrix and outcome variable. Here we make use of two influence measures including the existing I-score and our proposed MGR separately to assess the importance of predictor variables and their interactions. Without losing generality, we assume that the outcome variable Y is binary with value 0 and 1. At the same time, all predictor variables are discrete. Consider the partition (P_k) generated by a subset of k predictor variables (S_b=left{ X_{b_1},ldots ,X_{b_k} right}), (1le b_1le b_2le cdots le b_kcdots le p). If all predictor variables in the subset are binary then there are (2^k) partition elements in partition (P_k). Let (n_1(j)) be the number of observations with (Y=1) in partition element j where (jin P_k). Let ({bar{n}}_1(j)=n_jtimes pi _1) be the expected number of (Y=1) in element j under the null hypothesis that the subset of predictor variables has no association with Y, where (n_j) is the total number of observations in element j and (pi _1) is the proportion of (Y=1) observations in the sample. The I-score of (left{ X_{b_1},ldots ,X_{b_k}right}) is defined as [23]

$${I}(S_b)=sum _{jin P_k}left[ n_1(j)-{bar{n}}_1(j)right] ^{2}.$$

(1)

Gain Ratio with one-dimension predictor variable X is as follows [24]:

$$mathrm{GR}(X)=frac{Gain(X)}{SplitInfo(X)}=frac{Info(Y)-Info _X(Y)}{SplitInfo(X)},$$

(2)

where Gain(X) is the information gain provided by the predictor variable X in determining the outcome variable Y, SplitInfo(X) is the self-information of predictor variable X. Because the predictor variables’ classification number affects the results, Quinlan uses the self-information of predictor variable to divide information gain to correct the deviation of variables’ classification number [24]. Combined with the above hypothetical scenarios, we extend the one-dimension Gain Ratio to the multi-dimension case, which is Multivariate Gain Ratio:

begin{aligned} mathrm{MGR}(S_b)&=frac{Gain(S_b)}{SplitInfo(S_b)}\&=frac{Info(Y)-Info _{S_b}(Y)}{SplitInfo(S_b)}, end{aligned}

(3)

where (Info(Y)=-pi _1log(pi _1)-(1-pi _1)log(1-pi _1)),

(Info _{S_b}(Y)=-sum _{jin P_k}(frac{n_1(j)}{n}log(frac{n_1(j)}{n_j})+frac{n_j-n_1(j)}{n}log(frac{n_j-n_1(j)}{n_j})),)

(SplitInfo(S_b)=-sum _{jin P_k}frac{n_j}{n}log(frac{n_j}{n})).

Info(Y) measures the average amount of information needed to identify the outcome variable of a case. (Info _{S_b}(Y)) is the needed average information to identify the outcome variable of a case when the outcome variable Y is partitioned in accordance with the subset (S_b). Then (Gain(S_b)) measures the information gained by partitioning outcome variable Y in accordance with subset (S_b). (SplitInfo(S_b)) is the self-information of predictor variables (S_b), which represents the potential information generated by dividing the samples into (2^k) subsets. Because (Gain(S_b)) has a strong bias in favor of (S_b) with many classification numbers, we need to use (SplitInfo(S_b)) as a kind of normalization to rectify (Gain(S_b)). Then, (mathrm{MGR}(S_b)) expresses the proportion of information generated by the split and useful for classification. GR is an influence measure which could be used to detect important single biomarker variable. However, if a biomarker plays an important role in disease detection when combined with other biomarkers, it will be missed by using GR method. The following simulated example could illustrate the advantages of MGR in determining biomarker interactions. We generate a predictor variable matrix (X_{200times 10}) by randomly uniformly sampling from (left{ 0,1right}), where 200 is the number of samples and 10 is the number of predictor variables. The outcome variable Y is related to (X_{200times 10}) via the model

begin{aligned} Y=left{ begin{aligned} X_1(module2); with; probability; 0.5\ X_2+X_3(module2); with; probability; 0.5 ;. end{aligned} right. end{aligned}

Table 1 is the GR value of ten predictor variables from high to low. GR couldn’t select important single variable such as (X_2). However, from the simulated setting, (X_2) is an important variable in determining Y when combined with (X_3). The MGR of interaction (left{ {X_2,X_3} right}) is (84.4times 10^{-3}), which illustrates that (left{ {X_2,X_3} right}) could be detected by MGR. MGR does not require one to specify a model for the joint effect of (left{ {X_{b_1},ldots ,X_{b_k}} right}) on Y. It is designed to capture the information provided by (left{ {X_{b_1},ldots ,X_{b_k}} right}) for determining Y. The property makes MGR a useful measure. We carry out two experiments in order to explore the nature of I-score and MGR. In the first experiment, we use Leukemia Dataset [20] to explore if there are something in common between I-score and MGR without loss of generality. We repeat sampling p variable(s) from 3571 predictor variables 500 times ((p=1, 2,ldots ,9)). For example, when (p=1), we sample one variable as an interaction from 3571 predictor variables 500 times. We document I-score and MGR of the interaction as a numerical pair. Then we get the scatter plot Values of Cluster with 1 variable as Fig. 1a.

Due to the experiment results, I-score and MGR are consistent in the nature of growth. Next, we are ready to explore the difference between the two influence measures. According to Fig. 1a–i, the correlation of I-score and MGR varies with the number of variables in interaction. If we want to compare the importance of interactions with different number of variables, the influence measure must meet the demand that it’s unbiased for interactions with different number of variables. Then the value of influence measure stands for the importance of interactions completely rather than the interference of the number of variables. We continue to find out the law that I-score and MGR separately vary with the number of variables in interaction. Every time we randomly generate interaction containing p predictor variable(s) and outcome variable Y, (p=1,2,ldots ,9). Under this setting, outcome variable is independent of every predictor variable interaction. For fixed variable number p, we generate a predictor variable matrix (X_{10{,}000times p}) with dimension (10{,}000times p) by randomly uniformly sampling from (left{ 0,1right}) and outcome variable with dimension (10{,}000times 1) by randomly uniformly sampling from (left{ 0,1right}). Then record the corresponding value I-score and MGR of the interaction. Finally, we repeat the whole process 20,000 times. We draw the boxplot among the 20,000 interactions for every fixed p. Finally we get the results that the two influence measures vary with number of variables separately shown as Fig. 2.

We could explore the trend of I-score and MGR varying with the number of variables in interaction by the boxplot Fig. 2. Because every predictor variable and outcome variable are generated randomly and independently, the interaction with different predictor variables should be independent of outcome variable. Therefore, the two influence measures should not significantly vary with the number of predictor variables in interaction, which is the basis of using the two influence measures to evaluate the importance of interaction with different number of variables. However, Fig. 2 shows that the two influence measures both increase along with the number of variables. But the specific growth situation of them is different. When the number of variables is between 1 and 9, Fig. 2a shows that I-score varies with the number of variables in logarithmic function form, while MGR is in the form of the exponential function as shown in Fig. 2b. With the increase of variables, I-score increases rapidly, then the rate of increase slows down, and finally the values tend to be stable. While the increasing performance of MGR is opposite to that of I-score. When there is key interaction with a relatively small number of predictor variables in the datasets, the selected key interaction based on I-score contains other unimportant variables while carrying out BDA. Therefore we can come to a conclusion that MGR performs better in the case of interaction with a relatively small number of variables, and I-score is more suitable for interaction with more variables. Based on the diagram, the results of MGR are more credible than I-score when the number of predictor variables in interaction is less than five.

### Algorithms

In the first part of this section, we introduce BDA which is a repeated sampling algorithm used to find out the interactions. BDA is a greedy algorithm aiming at iterating over all the variables. Every time an initial predictor variable subset is selected, then delete variables one by one and finally find out the subset with maximum influence measure. The details are as follows.

1. (1)

Training set: Consider a training set ({(y_1,x_1),ldots ,(y_n,x_n)}), where (x_i=(x_{1i},ldots ,x_{pi})) with the dimension of p and the number of samples n. Typically p is very large. Outcome variable and all predictor variables are discrete.

2. (2)

Sampling for initial subset: Select k predictor variables (S_b=left{ X_{b_1},ldots ,X_{b_k} right}), (b=1,ldots ,B) as initial subset. B is the number of repetitions which will be introduced in step 3.

3. (3)

Compute the influence measure of the initial subset.

4. (4)

Drop variables: Tentatively drop each variable in (S_b) in turn and recalculate the influence measure of the subset with one variable less. Then drop the one that gives the highest influence measure. Keep this retained pending combination as (S’_b).

5. (5)

Return set: Continue the next round of dropping on (S’_b), until only one variable is left. Now we have k retained pending combinations with variable numbers ranging from k to one. Keep the subset that yields the highest influence measure among the retained pending combinations. Refer to this subset as the return set (R_b). Keep it for future use.

For ease of understanding, we construct an example as shown in Fig. 3 to illustrate how BDA is conducted in detail. In this example, we tentatively drop one biomarker from (left{ 1,2,3,4,5right}) in turn, and document the influence measure of the interaction with four remaining biomarkers. The interaction with the largest influence measure is put into the retained pending combination pool. Then drop every biomarker of it. Repeat the above process until only one biomarker is left. Finally we choose the retained pending combination with the largest influence measure as the return set. When the number of predictor variables is small enough, we can obviously use the BDA directly with the influence measure such as I-score or MGR to find interactions. If the predictor variables we meet are in tens of thousands, we must do some preparation before using BDA. We propose an algorithm to select important interactions in detail and construct the classifier of disease detection. The algorithm consists of five steps as shown in Fig. 4.

#### Step 1: Preprocessing

With the increase of the amount of biological data, searching for more effective computing algorithms will become an indispensable goal. Therefore, we need to apply preprocessing algorithm to help us preliminarily select valuable predictor variables. And we propose a preprocessing method in this paper, which could save the time of finding the appropriate preprocessing method. Preprocessing mainly includes four sub-steps: data cleaning; preliminary dimension reduction; data discretization; judgement of candidate method.

1. (1)

Data cleaning: The original dataset is transformed to (ntimes p) dimension, in which n stands for the number of samples, and p is the number of predictor variables. The sample status also needs to be prepared, which is represented by variable Y.

2. (2)

Preliminary dimension reduction: In this sub-step, we need to make sure if every predictor variable is meaningful in determination of variable Y. There are many methods such as t-test and Fold-change [25] aiming at selecting variables preliminarily. We choose the appropriate method to make it under different situations. We filter the biomarkers preliminarily, which is beneficial for our program later.

3. (3)

Data discretization: There are many discretization methods such as equal width method, equal frequency method, clustering method [26]. The most appropriate discretization method varies through different backgrounds.

4. (4)

Judgement of candidate method: We consult literature based on the background of the dataset to identify candidate methods about preliminary dimension reduction and data discretization. Then do some pre-experiments to determine the best candidate. For example, after dimension reduction and discretization, we carry out important interaction extraction experiments by BDA based on randomly chosen one hundred predictor variables with discrete values. If the accuracy of the classifier constructed by step 4 and step 5 based on these selected important interactions is reasonable, we carry out experiments on the whole dataset with the candidate preprocessing method.

Our proposed preprocessing method saves computation time and improves efficiency which is benefit for step 2 and step 3. If the number of predictor variables in the dataset is small enough, then step 2 can be omitted to get the important interactions directly with BDA.

#### Step 2: Dimension reduction by interacted triples

When the number of predictor variables is in the thousands or even larger after preprocessing, we must screen the predictor variables before applying BDA. We take interaction effect into consideration in the process of screening. We realize the dimension reduction by 3-way interactions (triples) [23]. This step consists of two sub-steps including obtaining top 3-way interactions and top variables.

1. (1)

Determination of top 3-way interactions: Firstly, sort the triples from high to low according to their respective influence measure. Then pick a triple out every 1000 triples from the beginning of the ordered sequence. Typically, the second difference of sequence consisting of every thousandth triple’s scores declines sharply in the beginning and stabilizes around zero later. We will choose top 3-way interactions according to the second difference scores of thousandth near to zero for the first time. And the top thousands of triples are retained for further analysis.

2. (2)

Determination of top variables: After the above sub-step, each variable in the high-scored triples may occur in multiple triples. Because high-frequency predictor variables are more likely to form influential predictor variable interactions, we select the important predictor variables by their retention frequency in top triples. We order the retention frequency of every predictor variables in the high-scored triples from high to low. And the first difference of retention frequency sequence usually shows big drops in the beginning and stabilizes around zero later. We choose the cut-off value when the first difference of retention frequency sequence is stable around zero.

Then we get key predictor variables that have high potential to construct important interactions when combined with each other. And the dimension of predictor variables decreases significantly from thousands to hundreds or dozens, which is beneficial for next step.

#### Step 3: Generation of interactions based on BDA

In this step, we realize BDA based on the selected variables in step 2. There are two quantities including the size of the initial subset and the number of repetitions to be determined before applying BDA. The size of the initial subset can be calculated by Poisson approximation [23]. Then the minimum requirement is met if the initial size k satisfies

$$frac{n^2}{2m_{k-1}}ge 1,$$

(4)

where n is the number of training samples, and (m_{k-1}) is the number of partition elements induced by a subset with (k-1) variables.

The approximation is adequate if the averaged number of observations per partition element is at least 4 [27]. So we can get the lower bound of the initial size. In practice, any initial size between the upper and lower bounds can be used. The number of repetitions in BDA is the number of variable subsets subjected to backward dropping. Assuming p is the number of predictor variables, and k is the initial size. If we want to cover all interactions with z variables ((zle k)), it is shown that we are expected to have

$$hat{B} approx left[ {{{left( {begin{array}{*{20}c} p \ z \ end{array} } right)} mathord{left/ {vphantom {{left( {begin{array}{*{20}c} p \ z \ end{array} } right)} {left( {begin{array}{*{20}c} k \ z \ end{array} } right)}}} right. kern-nulldelimiterspace} {left( {begin{array}{*{20}c} k \ z \ end{array} } right)}}} right]logleft( {begin{array}{*{20}c} p \ z \ end{array} } right).$$

(5)

Finally (2{hat{B}}) can be used as an upper bound which covers all key predictor variables [23]. Then we get return sets by applying BDA combined with two influence measures separately. And the return sets will undergo two filtering procedures. The first procedure is to filter out return sets with overlapping variables. And we shall keep only one of those return sets containing common variables. This can be done by sorting the return sets in decreasing order according to the influence measure and then removing those having variables in common with a higher-scored one. The return sets after removing overlap ones are subjected to a forward adding algorithm to remove false positives [23]. In order to use interactions selected from the above steps to determine the status of Y, we apply the following steps to construct the final classifier.

#### Step 4: Construction of the sub-classifier

We choose Ridge Regression method with cross-validation to construct sub-classifier based on interactions generated above. We include all product terms between variables from one interaction as the joint action of the variables to construct each Ridge Regression sub-classifier. Ridge Regression is appropriate because it alleviates the impact caused by colinearity between predictor variables in one interaction.

#### Step 5: Construction of the final classifier based on Boosting

The sub-classifiers are combined together to construct the final classifier by Boosting algorithm [23], which could greatly improve the performance of the final classifier.