# A deep learning approach for orphan gene identification in moso bamboo (Phyllostachys edulis) based on the CNN + Transformer model – BMC Bioinformatics

May 5, 2022

### Dataset

BLAST (2.11.0+) [26] was used to identify OG based on previous studies [6, 27,28,29]. Moso bamboo protein sequences were downloaded from Bamboo GDB [30]. First, we used BLASTp to search for homologs of all 31,987 proteins annotated in moso bamboo in each of the other 136 plant species released in Phytozome v12.1 [31] with an e-value cutoff of 1e−5. A total of 30,443 moso bamboo genes showed significant similarity to at least one sequence, which were defined as Evolutionarily Conserved genes (ECs) [32] and removed from further analysis (Fig. 1). Second, the remaining 1936 moso bamboo proteins for which no homologs could be found in any of the genomes were used for the next step of searches, which was performed by tBLASTn analysis. In this step, 392 moso bamboo genes were classified as ECs. The final sets of ECs and OG contained 30,443 and 1,544 genes (Additional file 1: Table S1), respectively (Fig. 1).

To adequately train the deep learning model, the 1544 obtained OG were identified with label 1, and 30,443 ECs were identified with label 0. These genes were combined to form the moso bamboo orphan gene dataset.

### Protein embedding

Protein sequences consist of possible 20 amino acids, each of which is represented by a capital letter. To make the protein sequence recognizable by a computer, the first step is to encode each amino acid in the protein according to Table 1, mapping each amino acid to a specific real number, where values of 1–20 represent the amino acid types, and unspecified or unknown amino acids are denoted as 21 [33, 34]. The amino acid coding sequence in the table does not affect the experimental results. The sequence profiles thus obtained for each sequence search were processed by truncating the profiles of long sequences to a fixed length (L) and zero padding short sequences, a method that is widely used for data preprocessing and effective training [35]. As a result, we obtained a one-dimensional vector for each protein.

$$s = (s_{1} ,s_{2} , ldots ,s_{L} )s_{i} in { 0,1, ldots ,21}$$

(1)

For protein sequences, by learning the dense continuous feature representation of each amino acid in the sequence, a distributional representation can be learned for the amino acids. When these embedding vectors are projected in 2D, it can be shown that amino acids with similarities in hydrophobicity, polarity and net charge, which are factors important for covalent chemical bonding, form visually distinguishable groups [36]. This validates the used of distributed representation as an effective method for encoding amino acids that also helps to preserve important physiochemical properties.

Hence, the sparse feature vectors (S) of a given protein sequence are transformed to dense continuous feature representations using word embedding transformation as follows:(F_{e}^{1} in {mathbb{R}}^{L times e}), where e corresponds to the embedding dimension. The self-attention mechanism in the transformer cannot distinguish between words at different positions, so the input sequence needs to be position-encoded to incorporate the positional information into the input sequence. The input sequence is then encoded with a two-dimensional matrix, (F_{e}^{2} in {mathbb{R}}^{L times e}). (F_{e}^{1}) and (F_{e}^{2}) are added together as the input of CNN + Transformer, as follows:

$$F_{e} = F_{e}^{1} + F_{e}^{2} = left[ {begin{array}{*{20}c} {s_{1,1} } & cdots & {s_{1,j} } & cdots & {s_{1,e} } \ vdots & cdots & vdots & cdots & vdots \ {s_{i,1} } & cdots & {s_{i,j} } & cdots & {s_{i,e} } \ vdots & cdots & vdots & cdots & vdots \ {s_{L,1} } & cdots & {s_{L,j} } & cdots & {s_{L,e} } \ end{array} } right]$$

(2)

Here, (S_{i,j}) corresponds to the (j^{th}) word embedding number of the (i^{th}) amino acid of the protein sequence after preprocessing.

### Transformer model

The canonicalization model used in this work was based on a transformer architecture consisting of two separate stacks of layers for the encoder and decoder, respectively [37]. The structure of the encoder and decoder, as shown in Fig. 2, mainly includes a multi-head attention mechanism layer, a feed-forward fully connected layer and normalization and residual connections [37, 38].

The multi-head attention mechanism incorporates some portion of knowledge written in its internal memory (V) with indexed access by keys (K). When new data arrive (Q), the layer calculates attention and modifies the input accordingly, thus generating the output of the self-attention layer and weighting the parts that carry the essential information. The formulas of Q, K, and V are as follows:

$$Q = F_{c} W_{i}^{Q}$$

(3)

$$K = F_{c} W_{i}^{K}$$

(4)

$$V = F_{c} W_{i}^{V}$$

(5)

(F_{c}) is the input matrix of the transformer model, (W_{i}^{Q}) is the query transformation matrix weight vector, (W_{i}^{K}) is the keyword transformation matrix weight vector, and (W_{i}^{V}) is the value transformation matrix weight vector.

Q and V perform dot product operations, and the result is divided by the scaling factor (sqrt d) The result is divided by the scaling factor and then multiplied by V following the application of the softmax function to obtain the result after self-attention. The output (H) of multi-head self-attention is obtained by splicing the self-attention N-head times, and the formula for multi-head self-attention is as follows.

$$head_{i} = {text{Multihead}}(F_{c} W_{i}^{Q} ,F_{c} W_{i}^{K} ,F_{c} W_{i}^{V} )$$

(6)

$${text{Attention}}(Q,K,V) = {text{softmax}}left( {frac{{QK^{T} }}{sqrt d }} right)V$$

(7)

$$H = {text{Concat}}(head_{1} ,head_{2} ,…,head_{N – head} )$$

(8)

where ({head}_{i}) denotes the i-th self-attention mechanism (1 <  = i <  = N-head). LayerNorm [39] indicates layer normalization, which mainly serves to speed up the convergence of the model, while the residual network structure is used to reduce the learning load of the model with the following equation.

$$H^{^{prime}} = {text{LayerNorm}}(H + {text{Dropout}}(H))$$

(9)

Dropout [40] is a stochastic deactivation strategy to prevent overfitting in models with a large number of parameters. The feed-forward layer enhances the nonlinear capability of the model with two layers of neural networks, and the transformer structure continues after the feed-forward layer with a normalization and residual layer, according to the following equation.

$$O = {text{LayerNorm}}(H^{^{prime}} + {text{Dropout}}({text{Relu}}(H^{^{prime}} W^{(1)} + b^{(1)} )W^{(2)} + b^{(2)} ))$$

(10)

where O is the output vector of the transformer layer in the model, H′ is the normalized vector, and (W^{left( 1 right)}), (W^{left( 2 right)}), (b^{left( 1 right)}), and (b^{left( 2 right)}) are the weight coefficients and bias of the 2-layer neural network, respectively.

### CNN + Transformer model

An important disadvantage of the transformer model is its inefficiency in processing long sequences, mainly due to the computation and memory complexity of the self-attention module [41]. CNN can extract local features in the sequence to shorten the length of the sequence [42, 43]. Therefore, this research proposes the CNN + Transformer model structure, which combines a CNN and a transformer model. As shown in Fig. 3, the proposed CNN + Transformer model structure is composed of two multicore convolution layers: a transformer layer and a fully connected layer.

After protein embedding, the protein sequences are encoded as dense, continuous vectors ((F_{e})) as the input of the CNN layer. The CNN layer in the model consists of two convolutional layers, with the first convolutional layer containing six convolution kernels, as shown in Fig. 3. The variable size of the filter in the convolution is designed to capture k-mer amino acid fragments, where k ranges from 2 (2 peptides) to 7 (7 peptides). The second layer consists of three convolution kernels, denoted as (left{ {k_{j}^{n} } right}_{n = 1,2,3j = 3,6,9}), where n represents the first few kernels, and j represents the size of the corresponding kernel. The kernel size is equal to the size of a convolutional window across j characters, and the parameters are tuned according to the training and validation step. The intermediate feature map of the i-th CNN layer is extracted as (F_{m}^{i} = {text{Conv}}(F_{e} ,K^{i} )).

After obtaining the intermediate convolutional feature map ((F_{m}^{i})), downsampling is performed using AvgPooling by taking the average of the output subregions of the CNN layer, which helps maintain the integrity of the information and facilitates subsequent global feature extraction. After average pooling, the output from all kernels is concatenated for another average pooling operation, which is used to generate the next layer of the feature maps ((F_{c})), with the following equation:

$$F_{c} = {text{AvgPooling}}({text{Concat}}({text{AvgPooling}}(F_{m}^{i} )))$$

(11)

where AvgPooling and Concat are average pooling operations and connection operations, respectively.

The transformer layer in the CNN + Transformer model is composed of three transformer encoder-decoder layers. In the transformer layer, the feature representation of long-range interaction information between amino acids is obtained by introducing a multi-head self-attention mechanism, and the values of two hyperparameters in the transformer layer, the number of self-attention heads (N-head) and the number of transformer layers (Num-layer), are discussed and explained in the “Results”.

The output of the transformer layer is flattened to one dimension. The number of output hidden vectors in the fully connected layer is 2, which indicates the binary classification predicted by the model, and the output vector of the binary classification is transformed into the probability (p) by the ReLU activation function.

$$T^{(1)} = sigma (W^{l} T^{(l – 1)} + b^{l} )$$

(12)

$$p = sigma (W^{s} T^{(l)} + b^{s} )$$

(13)

where (sigma) is the activation function, (l > = 1) is the number of layers of the multiconnected neural network, and (W^{l}) and (b^{l}) are the connection weights and biases of the hidden nodes from layer (l – 1) to layer (l) in the fully connected layer, respectively.(W^{l} in {mathcal{R}}^{{n_{l – 1} times n_{l} }}), (n_{l – 1}) and (n_{l}) are the numbers of hidden nodes in layers (l – 1) and (l), respectively. (T^{(l)}) is the output hidden vector of layer (l). (W^{s}) and (b^{s}) are the weights and biases, respectively, of the penultimate layer of the fully connected neural network, and (W^{s} in {mathcal{R}}^{{n_{l} times 2}}).

### Implementation details

The loss function is cross-entropy loss function and using the Adam optimizer [44]. In the training set, there were approximately 20 times more moso bamboo OG than moso bamboo non-OG, which led to an imbalance problem during training. We explored a cost-sensitive technique for addressing the imbalance problem when training the classifier. The experiments were conducted by adding weights to the different categories of data during training according to the number ratio and using the category weights to train the model. CNN + Transformer involved multiple hyperparameters. These hyperparameters were tuned on the validation set using agrid search procedure. Their optimal values are mentioned below:

1. 1.

Embedding dimension: we tested for (e in { 10,20,30,40,50,60,70,80,90,100}) and found that the optimal model performance was obtained at e = 50.

2. 2.

Convolution filters: at the first convolution layer, we chose six convolution filters, s.t. (f_{k}^{{1}} in { 2,3,4,5,6,7}). This allowed us to capture amino acid k-mer frequencies for k-mers of lengths, k = 2 to k = 7. These k-mers represent the local contextual ‘biological’ words. For the second convolution layer, the optimal filter sizes were (f_{k}^{{2}} in { 3,6,{9}}). This led to inference of interactions between amino acid k-mers i.e. detect frequencies of local contextual biological phrases consisting of two k-mers having same or different k. For example, the second convolution layer could apprehend interactions between two different dipeptides as well asestimate frequency of a biological phrase comprising a dipeptide and a tripeptide.

3. 3.

Transformer encoder-decoder layer number L and self-attention mechanism number H: We took {2, 3, 4} for L and {5, 10} for H, and formed six combinations of L and H, { (2, 5), (2, 10), (3, 5), (3,10), (4,5), (4,10)}. The optimal model performance was attained for L = 3 and H = 10.

4. 4.

Fully connected layer dimension: we tested for (f_{c} in { 64,128,256,512}) and for optimal model (f_{c}) was 256.

5. 5.

Learning rate: the learning rate for the Adam optimizer was 0.001.

6. 6.

Number of epochs: the maximum number of epochs was set to 100 but we enforced early stoppage if the validation loss function stopped improving for two consecutive epochs.

7. 7.

Batch size: we tested for batch sizes {64, 128, 256}. The optimal model performance was attained for batch size = 128.

All deep learning models were implemented in Pytorch (1.7.1) [45]. To speed up the training process, a GPU version of PyTorch on an NVIDIA Tesla P100 PCIe 16 GB system was used for the experiments.

### Evaluation strategies

The identification of OG is an imbalance problem, and the number of non-OG in the moso bamboo orphan gene dataset is approximately 20 times greater than the number of OG. Although accuracy and F1 scores are very popular classification evaluation metrics, they can produce misleading results for unbalanced datasets because they do not take into account the ratio between positive and negative samples, and classifiers can achieve good results in terms of specificity but show a large number of false positives. An effective solution for overcoming the class imbalance issue comes from the MCC and BA [46]. When dealing with an unbalanced dataset, GM and BM are better performance metrics if the classification success rate is of concern [47]. Therefore, BA, BM, GM, and MCC were selected as the evaluation metrics in the experiment. The evaluation indicators used in the article are summarized as follows:

$${text{BA = }}frac{{1}}{{2}} times left( {frac{{{text{TP}}}}{{text{TP + FP}}}{ + }frac{{{text{TN}}}}{{text{TN + FN}}}} right)$$

(14)

$${text{GM = }}sqrt {frac{{{text{TP}}}}{{text{TP + FN}}} times frac{{{text{TN}}}}{{text{TN + FP}}}}$$

(15)

$${text{BM}} = frac{{{text{TP}}}}{{text{TP + FN}}} + frac{{{text{TN}}}}{{text{TN + FP}}} – 1$$

(16)

$${text{MCC}} = frac{{{text{TP}} times {text{TN – FP}} times {text{FN}}}}{{sqrt {{text{(TP + FP)}} times {text{(TP + FN)}} times {text{(TN + FP)}} times {text{(TN + FN)}}} }}$$

(17)

Here, TP is the number of OG identified as OG, FN is the number of OG identified as non-OG, FP is the number of non-OG identified as OG, and TN is the number of non-OG identified as non-OG.