# Improved cost-sensitive representation of data for solving the imbalanced big data classification problem – Journal of Big Data

#### ByMahboubeh Fattahi, Mohammad Hossein Moattar and Yahya Forghani

May 3, 2022

Assume that (X in {mathbb{R}}^{d times n}) is a data matrix that represents n data d dimensional in which (x_{i}) is equal to i-th data point. The above data label is represented by the vector (Y = left{ {y_{1} ldots y_{n} } right}) where (y_{i} in left{ { – 1. + 1} right}). To clarify the rest of the proposed approach, Table 1 is included which summarizes the frequently used notations. Matrix (Z in {mathbb{R}}^{m times n}) is a reduced latent representation of Z where (m ll d). In other words, the Z matrix is the result of the feature extraction operation on X, which can be generated by the following mapping:

$$X = QZ + varepsilon ,$$

(1)

where (Q in {mathbb{R}}^{d times m}) is the mapping matrix and (varepsilon in {mathbb{R}}^{d times n}) is the reconstruction error. In order to minimize the reconstruction error, a simple solution can be the soft minimization of the following equation:

$$mathop {min }limits_{z.Q} frac{1}{n}sumlimits_{i = 1}^{n} {left| {x_{i} – Qz_{i} } right|^{2} } .$$

(2)

Suppose that the feature selection is done on input space through a diagonal matrix (sigma) such that (diagleft( sigma right)) consists of 0 and 1 where (sigma_{j} = 1), denotes that the (j{text{th}}) feature is selected from a dataset and vice versa. Assuming feature selection by (sigma), Eq. (2) will be written as follows:

$$mathop {min }limits_{z.Q} frac{1}{n}sumlimits_{i = 1}^{n} {left| {x_{i} sigma – Qz_{i} } right|^{2} } .$$

(3)

Adding the regularization term to the above optimization problem, the following equation is obtained:

$$mathop {min }limits_{z.Q} frac{1}{n}sumlimits_{i = 1}^{n} {left| {x_{i} sigma – Qz_{i} } right|^{2} } + Csumlimits_{i = 1}^{n} {left| {z_{i} } right|^{2} } ,$$

(4)

where the non-negative parameter (C) indicated the penalty for the regularization term. In addition, to control the (Q) matrix and prevent large values, (Q^{T} Q = I) constraint is added to the objective function. Therefore, the following optimization problem is obtained:

begin{aligned} & mathop {min }limits_{Z.Q} frac{1}{n}sumlimits_{i = 1}^{n} {left| {x_{i} sigma – Qz_{i} } right|^{2} } + Csumlimits_{i = 1}^{n} {left| z right|^{2} } \ & s;tquad Q^{T} Q = I. \ end{aligned}

(5)

In order to increase the discriminability of classes and preserving the primitive geometry structure in the reduced space (Z), another term is also added to objective function named as locally alignment. The following problem is attained by adding this term to an objective function. (LAleft( Z right)) will be discussed later.

begin{aligned} & mathop {min }limits_{Z.Q} frac{1}{n}sumlimits_{i = 1}^{n} {left| {x_{i} sigma – Qz_{i} } right|^{2} } + Csumlimits_{i = 1}^{n} {left| {z_{i} } right|^{2} } + {varvec{LA}}left( Z right) \ & s.;t.quad Q^{T} Q = I. \ end{aligned}

(6)

where (LA)((Z)) would be as:

$${varvec{LA}}left( Z right) = frac{1}{{nk_{1} }}sumlimits_{i = 1}^{n} {sumlimits_{j = 1}^{{k_{1} }} {left| {z_{i} – z_{ij} } right|^{2} } } – frac{beta }{{nk_{2} }}sumlimits_{i = 1}^{n} {sumlimits_{p = 1}^{{k_{2} }} {left| {z_{i} – z^{ip} } right|^{2} } } .$$

(7)

Minimization of Eq. (7) means that distance of reduced data (z_{i}) from (k_{1}) intra-class nearest neighbors should be decreased and the distance from (k_{2}) farther neighbors from the opposite class should be increased. In Eq. (7), (z_{ij}) denotes the (j{text{th}}) nearest neighbor for (i{text{th}}) data point with the similar label; and (z^{ip}) is the (p{text{th}}) nearest neighbor for (i{text{th}}) data from the opposite class. Also, the parameter (beta) indicates the importance level of the second term in the equation. It must be noted that (LA) makes the algorithm supervised because the class labels must be known to compute intra-class or inter-class neighbors.

The optimization problem is similar to the support vector machine (SVM) optimization [10] and is formulated as follows which performs simultaneous feature selection and classification.

begin{aligned} & mathop {min }limits_{omega .b.sigma .xi } {mathbf{1}}^{T} xi \ & s.;t. \ & y_{i} left( {mathop sum limits_{j = 1}^{m} z_{ij} omega_{j} sigma_{j} + b} right) ge 1 – xi_{i} quad i = 1 ldots n \ & omega^{T} omega le 1 \ & R_{ + } ge mathop sum limits_{j = 1}^{m} nu_{j}^{ + } sigma_{j}^{2} \ & R_{ – } ge mathop sum limits_{j = 1}^{m} nu_{j}^{ – } sigma_{j}^{2} \ & xi .sigma succcurlyeq 0. \ end{aligned}

(8)

In (8), (xi) is the slack variable and 1 is a vector of ones having the length equal to (xi), the matrix multiplication of ({mathbf{1}}^{T} xi) denotes the summation of the (xi) values. Also, (omega) denotes the coefficients of separating hyper-plane, (b) is the bias, and (z_{ij}) is the (j{text{th}}) feature of the (i{text{th}}) data. Also,

$$v_{j}^{ – } = frac{1}{{n_{ – } }}mathop sum limits_{{i in I_{ – } }} left( {z_{ij} } right)^{2} quad andquad v_{j}^{ + } = frac{1}{{n_{ + } }}mathop sum limits_{{i in I_{ + } }} left( {z_{ij} } right)^{2} .$$

(9)

Which means the empirical mean of the second order moment of the (j{text{th}}) feature in the negative and positive classes respectively. (n_{ – }) and (n_{ + }) are the number of negative and positive class data, respectively.

One of the constraints in (8), is to prevent from exorbitant growth of (mathop sum nolimits_{j = 1}^{m} nu_{j}^{ + } sigma_{j}^{2} ;{text{and}};mathop sum nolimits_{j = 1}^{m} nu_{j}^{ – } sigma_{j}^{2}) which have the upper bound (R_{ + }) and (R_{ – }) respectively. It is evident that by choosing a low value for (R_{ + }) and (R_{ – }), the optimum solution for (sigma) would include more zero values and consequently it results in a higher feature reduction rate.

Another constraint in the problem (8), i.e. (omega^{T} omega le 1), is the bounding over (l_{2})-norm of (omega) which means a regularization role. Now the final optimization problem resulting from combining the two relations (6) and (8) is as follows:

begin{aligned} & mathop {min }limits_{Q.Z.omega .b.sigma .xi } 1^{T} xi + frac{1}{n}sumlimits_{i = 1}^{n} {left| {x_{i} sigma – Qz_{i} } right|^{2} } + frac{1}{{nk_{1} }}sumlimits_{i = 1}^{n} {sumlimits_{j = 1}^{{k_{1} }} {left| {z_{i} – z_{ij} } right|^{2} } } – frac{beta }{{nk_{2} }}sumlimits_{i = 1}^{n} {sumlimits_{p = 1}^{{k_{2} }} {left| {z_{i} – z^{ip} } right|^{2} } } + Csumlimits_{i = 1}^{n} {left| {z_{i} } right|^{2} } \ & s.;t. \ & Q^{T} Q = I \ & y_{i} left( {mathop sum limits_{j = 1}^{m} z_{ij} omega_{j} sigma_{j} + b} right) ge 1 – xi_{i} quad andquad i = 1 ldots n \ & omega^{T} omega le 1 \ & R_{ + } ge mathop sum limits_{j = 1}^{m} nu_{j}^{ + } sigma_{j}^{2} \ & R_{ – } ge mathop sum limits_{j = 1}^{m} nu_{j}^{ – } sigma_{j}^{2} \ & xi .sigma succcurlyeq 0. \ end{aligned}

(10)

In the above optimization problem, the input data is assumed to be balanced. This means that if there are two classes of data, approximately equal amounts of data are available from each of them. Since this assumption is not always true in real-world datasets, a cost-sensitive function is added to reinforce the above optimization problem, such as [21]. Therefore, the following optimization problem will be resulted with the same constraints in Eq. (8):

$$mathop {min }limits_{omega .b.sigma .xi } C_{ + } mathop sum limits_{{i in I_{ + } }} xi_{i} + C_{ – } mathop sum limits_{{j in I_{ – } }} xi_{j} .$$

(11)

In problem (11), the slack values for the positive and negative class data are added with different coefficients. Suppose the positive class is minor; therefore, to prevent the model from deviating towards that class, the C+ coefficient can be selected as a larger number than C. This means that since the slack of the positive samples will be more fined, the model will not deviate towards them.

The cost-sensitive optimization problem now becomes:

begin{aligned} & mathop {min }limits_{Q.z.omega .b.sigma .xi } C_{ + } sumlimits_{{i in I_{ + } }} {xi_{i} } + C_{ – } sumlimits_{{j in I_{ – } }} {xi_{j} } + frac{1}{n}sumlimits_{i = 1}^{n} {left| {x_{i} sigma – Qz_{i} } right|^{2} } + frac{1}{{nk_{1} }}sumlimits_{i = 1}^{n} {sumlimits_{j = 1}^{{k_{1} }} {left| {z_{i} – z_{ij} } right|^{2} – frac{beta }{{nk_{2} }}sumlimits_{i = 1}^{n} {sumlimits_{p = 1}^{{k_{2} }} {left| {z_{i} – z^{ip} } right|^{2} } } } } + Csumlimits_{i = 1}^{n} {left| {z_{i} } right|^{2} } \ & s.;t. \ & Q^{T} Q = I \ & y_{i} left( {mathop sum limits_{j = 1}^{d} x_{ij} omega_{j} sigma_{j} + b} right) ge 1 – xi_{i} quad andquad i = 1 ldots n \ & omega^{T} omega le 1 \ & R_{ + } ge mathop sum limits_{j = 1}^{d} nu_{j}^{ + } sigma_{j}^{2} \ & R_{ – } ge mathop sum limits_{j = 1}^{d} nu_{j}^{ – } sigma_{j}^{2} \ & xi .sigma succcurlyeq 0. \ end{aligned}

(12)

Solving the proposed cost-sensitive optimization problem, Eq. (12), leads to the fact that in addition to feature extraction, feature weighting is performed simultaneously. The object of the problem is to find (z_{i}), which is the result of extracting the property on (x_{i} sigma). Also, since the optimization problem is cost sensitive due to the lack of proper classification of positive and negative classes, this leads the feature reduction process to the output properties that are not only suitable for the majority class but also for the minority class.

The proposed problem, causes more separation in the reduced space in two ways. One for the existence of the expression LA and the other for the existence of the constraint (y_{i} left( {mathop sum limits_{j = 1}^{d} x_{ij} omega_{j} sigma_{jj} + b} right) ge 1 – xi_{i} .)

Feature extraction as a solution to a reduced manifold learning optimization problem is based on error reduction and maintaining geometric relationships between data. Also, in order to select the features, optimization problems based on the minimization of the above the generalization error have been adopted. Finally, the optimization problem combined from the above two problems is solved by adding a cost-sensitive expression to create a balance without manipulating the data in the imbalanced data. The flowchart of the proposed approach in illustrated in Fig. 1.

### The pseudo code

The steps of the optimization algorithm are denoted in Alg. 1.

### Algorithm I

An iterative solution to optimize the proposed problem

• Input: data matrix (X = left{ {x_{1} ldots x_{n} } right}) where (x_{i} in {mathbb{R}}^{d}) (i = 1 ldots n) and labels (y_{i} in left{ { – 1. + 1} right}) Output: dimensionality reduced data (z_{i}) (forall i = 1 ldots n), (omega). and (b).

1. 1:

Initialize (Q) and (sigma) with random values.

2. 2:

For (t = 1 ldots max_iteration) do

3. 3:

Solve the problem associated with step 1 to oain (z_{i}) at the (left( {t + 1} right){text{th}}) iteration by:

$$z_{i}^{(t + 1)} = left[ {k_{1} k_{2} Q^{(t)T} Q^{(t)} + left( {k_{1} k_{2} – beta k_{1} k_{2} + k_{1} k_{2} C} right)I} right]^{ – 1} left[ {k_{1} k_{2} Q^{(t)T} left( {x_{i} sigma^{(t)} } right) + k_{2} mathop sum limits_{j = 1}^{{k_{1} }} left( {z_{ij} } right)^{(t)} – k_{1} beta mathop sum limits_{p = 1}^{{k_{2} }} left( {z^{ip} } right)^{(t)} } right],$$

4. 4:

Solve the problem associated with step 2 to obtain (Q) by:

$$Q^{(t + 1)} = left( {mathop sum limits_{i = 1}^{n} left( {xsigma^{(t)} } right)z_{i}^{(t + 1)T} } right)left( {mathop sum limits_{i = 1}^{n} z_{i}^{(t + 1)} z_{i}^{(t + 1)T} } right)^{ – 1} ,$$

5. 5:

Solve the dual problem (12) using (z_{i}^{{left( {t + 1} right)}}) , (Q^{{left( {t + 1} right)}})

6. 6:

Exit, if convergence criterion meets.

7. 7:

end for

As it may be inferred from the above algorithm, there is a loop in line 2 which is iterated max_iteration times. In the loop, (z_{i}) is first computed which as seen, has a constant time with respect to the number of samples and features. Therefore, calculating the whole Z has a time complexity of O(n). Calculating Q is the most complex stage of the approach. The first parenthesis on Q is calculated in O(nm) while the second parenthesis has the same complexity. Therefore, the complexity of finding Q is O(2nm). For solving Eq. (12), we have six summations which are computed over n. Therefore, considering approximately constant operations in each summation, the computational complexity of the last stage in O(cn).

Having all these complexities for max_iteration times, the total time complexity of the algorithm will be O(max_iteration * (n + 2 nm + cn)) in which n is the number of samples and m is the number of final extracted features. Therefore, as seen, the time complexity of the algorithm is a linear function of n and m which is not much as compared to similar approaches and implementing the approach in parallel will decrease the run time of the algorithm.