# An effective drug-disease associations prediction model based on graphic representation learning over multi-biomolecular network – BMC Bioinformatics

Jan 4, 2022

### Multi-biomolecular associations data

In this work, the SCMFDD-S data set collected by Zhang et al. [29] is used for training, which includes 269 drugs, 598 diseases, and 18,416 DDAs. DrugBank [30] is a comprehensive database of extensive drug information, providing SMILE for drugs. We use python packages to convert SMILE to Morgan fingerprints. In addition, as shown in Table 1, we downloaded eight types of heterogeneous associations from nine other databases, 8374 pairs of miRNA-lncRNA association provided by lncRNASNP2 database, 16,427 pairs of miRNA-disease association provided by HMDD database [31], 4944 pairs of miRNA-protein association provided by miRTarBase database [32], and 1264 pairs of lncRNA-disease association provided by LncRNADisease [33] and lncRNASNP2 [34] databases. LncRNA2Target [35], DisGeNET [36], DrugBank, and STRING [37] provided 690 pairs of lncRNA-protein associations, 25,087 pairs of protein-disease associations, 11,107 pairs of drug-protein associations, and 19,237 pairs of protein–protein interactions [38,39,40]. After unifying identifiers, eliminating redundancy, simplify, and deleting irrelevant items, the downloaded experimental data are sorted out and obtained in Table 2.

### Disease descriptors

In order to represent the similarity between diseases, we calculated disease semantic similarity by referring to the MeSH database [41], which developed by the National Library of Medicine (NLM). The MeSH database categorizes diseases strictly and accurately. Each disease we download from https://www.nlm.nih.gov/ has a descriptor that can construct a directed acyclic graph (DAG) to describe the disease. Specifically, for disease (e), and its DAG can be described as ({DAG}_{e}=(e,{N}_{e},{D}_{e})), where ({N}_{e}) represents the set of diseases associated with disease (e), and ({D}_{e}) represents the set of edges between them. The contribution of a certain disease (d) to the semantic value of disease (e) in the set ({N}_{e}) is:

$$left{begin{array}{ll}{C}_{e}left(dright)=1 ,& if ,d=e,\ {C}_{e}left(dright)=mathit{max}left{varepsilon cdot {C}_{e}left({acute{d}}right)|{acute{d}}in children; of; dright},& if ;d ne e,end{array}right.$$

(1)

where (varepsilon) is a contribution parameter. The semantic value (DV(e)) can be obtained by adding up the contribute values of all diseases in the disease set ({N}_{e}), and its formula is as follows [42]:

$$DVleft(eright)={sum }_{din {N}_{e}}{C}_{e}(d)$$

(2)

Assume that the more DAGs shared by two diseases, the more similar they are. Based on this assumption, diseases semantic similarity is calculated according to the relative positions of diseases (e(i)) and (e(j)):

$${SV}_{1}left(eleft(iright),eleft(jright)right)=frac{{sum }_{din {N}_{eleft(iright)}bigcap {N}_{e(j)}}({C}_{eleft(iright)}left(dright)+{C}_{eleft(jright)}(d))}{DVleft(eleft(iright)right)+DV(e(j))}$$

(3)

### NcRNA and protein sequence descriptors

In order to standardize and characterize the ncRNA transcription and protein sequences, we use 3-mer to analyze each sequence. As shown in Fig. 2, in order to facilitate the coding of proteins and ncRNA, we divided the 20 amino acids and the four nucleotides into 4 groups. The grouping of amino acids is: [Ala, Val, Leu, Ile, Met, Phe, Trp, and Pro], [Gly, Ser, Thr, Cys, Asn, Gln, and Tyr], [Arg, Lys, and His], and [Asp and Glu] [43]. The grouping of ncRNAs is adenine (A), cytosine (C), guanine (G), and uracil (U). As shown in Fig. 2c, we calculate the frequency of each different amino acid or RNA combination through a sliding window of length 3. Here, we can express a 64 (43) dimensional vector through 3-mer.

### Stacked auto-encoder

As shown in Fig. 2b, the SIMLES (simplified molecular input line entry specification) of the drug can be found in the DrugBank database. The RDkit python package can convert SIMLES into Morgan fingerprints [44, 45]. In this work, Stacked Auto-encoder (SAE) is introduced to extract the constructed Morgan fingerprints. As shown in Fig. 3a, auto-encoder is a kind of symmetric neural network, which belongs to semi-supervised learning, and its learning function is ({acute{x}}= {f}_{W,b}(x)approx x), where (x) is the input vector, (W=({W}_{1},{W}_{2})) and (b=({b}_{1},{b}_{2})) represent the weights and biases.

Figure 3b shows the structure of a stacked auto-encoder with an h-stage auto-encoder. The vector output by the first auto-encoder layer is used as the vector of the second auto-encoder layer input until the output vector of the top autoencoder layer is obtained. The random gradient descent was selected for training. Drug molecular fingerprints obtain a vector characterizing molecular structure by stacking autoencoder.

### Node representation

In the MAN, each node is composed of two parts, one is the attributes of the node itself, and the other is the association with other nodes. Attributes of the node itself include ncRNA sequences, protein sequences, semantic information of disease, and drug fingerprints. Specifically, the network representation learning is used to calculate the association between nodes and other nodes which can globally represent the information flow between the entire network nodes. Due to the sparseness and discreteness of the MAN network, we urgently need a simple and efficient low-dimensional representation method to represent it, and graph embedding is such a method. As the current mainstream network embedding algorithm, LINE [46] can embed large-scale information networks into low-order vector spaces and is suitable for any type of information network.

LINE is a method based on the assumption of neighborhood similarity, which can be seen as an algorithm that use Breath First Search (BFS) to construct neighborhoods. A major feature of LINE is that it optimizes the goal of preserving local nodes and global network structure. LINE combines the first-order similarity and second-order similarity in the graph structure to obtain richer graph representation results. Figure 2d explains first-order and second-order. The thickness of the edge represents the value of the weight. Because node 6 and node 7 are directly connected and have a larger weight, their first-order similarity is higher. In the MAN network, the weights of the edges are all equal. Node 5 and node 6 are not directly connected, but they share a common adjacent node, so their embedding should have a similar distance and a greater second-order similarity.

First-order is to model each undirected edge. First, calculate the probability distribution of node transition. For each directed edge ((a,b)), we first define the probability that the neighbor of vertex ({v}_{a}) is ({v}_{b}) as:

$${p}_{1}left({v}_{b}|{v}_{a}right)=frac{1}{1+mathrm{exp}(-{u}_{b}^{T}cdot {u}_{a})} ,$$

(4)

where ({u}_{a}) and ({u}_{b}) are the embedding vector representations of node (a) and node (b), respectively. According to the weights of the edges, the empirical distribution can also be obtained:

$${{p}_{1}}^{prime}left(a,bright)=frac{{omega }_{ab}}{W} ,W= {sum }_{i,jepsilon E}{omega }_{ij},$$

(5)

where (W) is the sum of the weights of the edges in the graph. In order to keep the empirical distribution similar to the probability distribution, we use KL divergence to measure the similarity of the two distributions. After we remove the constant term, the loss function obtained is as follows:

$${L}_{1}= -{sum }_{left(a,bright)epsilon E}{omega }_{ab}mathrm{log}left({p}_{1}left({v}_{a},{v}_{b}right)right),$$

(6)

therefore, as long as the ({L}_{1}) is minimized, we can guarantee the first-order similarity of node embedding in the graph.

Second-order applies to both directed and undirected graphs. We first define the probability distribution of node transition:

$${p}_{2}left({v}_{b}|{v}_{a}right)=frac{mathrm{exp}({{acute{u}}_{b}}^{T}cdot {u}_{a})}{{sum }_{k=1}^{|V|}mathrm{exp}({{acute{u}}_{k}}^{T}cdot {u}_{a})}$$

(7)

where (left|Vright|) is the number of vertices, ({u}_{a}) is the representation when ({v}_{a}) is regarded as vertex and ({{acute{u}}_{a}}) is the representation of ({v}_{a}) when it is treated as a specific “context”. At the same time, the second-order empirical distribution is defined as follows:

$${{p}_{2}}^{prime}left({v}_{b}|{v}_{a}right)=frac{{omega }_{ab}}{{d}_{a}} ,{d}_{a}= {sum }_{kin N(i)}{omega }_{ik},$$

(8)

where ({d}_{a}) is the output degree of node (a) and (N(i)) is the adjacent node of node (i).

To make sure the empirical distribution and the probability distribution similar. we use KL divergence to measure the similarity of the two distributions. After removing the constant term and performing a series of approximations, we get the loss function as follows:

$${L}_{2}= -{sum }_{left(a,bright)epsilon E}{omega }_{ab}mathrm{log}left({p}_{2}left({v}_{b}|{v}_{a}right)right).$$

(9)

### Random forest

Ensemble learning has been widely used in bioinformatics, the idea of which is to combine multiple single classifiers into a new classifier to obtain better classification effect. We choose the random forest classifier in the ensemble learning algorithm to classify and predict the drug-disease association [47]. Random forest can avoid the problem of decision tree overfitting. Compared with other single classifiers, it usually has more stable prediction performance [48]. Since stability and accuracy are very important for large-scale prediction of drugs-diseases association, in this work, random forest was selected as the classifier to process the extracted features.