# Prediction of lncRNA–Disease Associations via Closest Node Weight Graphs of the Spatial Neighborhood Based on the Edge Attention Graph Convolutional Network Jianwei Li, et al.

Jan 4, 2022

## Introduction

Long non-coding RNAs (lncRNAs) are a large and important class of non-coding RNAs with a molecular length more than 20 nucleotides (Ponting et al., 2009). In recent years, more and more biological experiments and clinical studies have demonstrated that lncRNAs participate in almost all the stages of organism life, from regulating single cell life span to maintaining the homeostasis stability of the whole organism, which are closely implicated in the occurrence and development of various complex human diseases. Many human diseases are caused by the dysfunctions of lncRNAs or their abnormal expressions that are reflected in the associations between lncRNAs and diseases (Kapranov et al., 2007; Mercer et al., 2009; Guttman et al., 2013). Therefore, the studies of lncRNA–disease associations are helpful to deeply understand the pathogenesis of complex human diseases at the molecular level and would be increasingly used to aid in the prevention, diagnosis, and treatment of diseases (Wang and Chang, 2011). Due to the high cost of traditional biological experiments of identifications of lncRNAs, there are only a relatively limited number of known lncRNA–disease associations that have been confirmed; thus, identifying potential lncRNA–disease associations has become a hot topic through computational models in the fields of human complex diseases.

Nowadays, many computational models based on integrating a vast amount of heterogeneous biological data have been proposed to predict novel lncRNA–disease associations. Broadly, they can be categorized into two types. The models in the first category are based on homogeneous or heterogeneous biological information networks. For example, Liao et al. (2011) constructed a coding–non-coding gene co-expression network for predicting probable functions for altogether 340 lncRNAs based on topological or other network characteristics. Yang et al. (2014) developed a coding–non-coding gene–disease bipartite network based on the known associations between diseases and disease-causing genes, and applied a propagation algorithm mining 768 potential lncRNA–disease associations in the constructed network. Sun et al. (2014) proposed a global network–based model, RWRlncD, which inferred lncRNA–disease associations with the random walk with a restart algorithm of the lncRNA functional similarity network. However, RWRlncD cannot be applied to the diseases which have no verified association with any lncRNA. Chen et al. (2016) reported an improved random walk with the restart model, IRWRLDA, which could be applied to diseases without any known related lncRNAs through setting the initial probability vector. Fu et al. (2018) predicted lncRNA–disease associations by translating row data matrices into low-rank matrices in the heterogeneous data with matrix tri-factorization for gaining their intrinsic and shared structure. Ding et al. (2018) integrated lncRNA–disease–gene information and lncRNA–disease associations to describe the heterogeneity of coding–non-coding gene–disease association, and proposed an lncRNA–disease–gene tripartite graph to predict potential lncRNA–disease associations. Wang et al. (2019) proposed a new prediction model based on the internal inclined random walk with the restart algorithm. A novel method called network consistency projection was proposed by Xie et al. (2019), based on integrating a known lncRNA–disease association network, a lncRNA–disease cosine similarity network, and a lncRNA expression similarity network, exhibiting good predictive performance. Xie et al. (2020) developed a new method based on linear neighborhood similarity and unbalanced bi-random walk for lncRNA–disease association prediction. After the preprocessing of the lncRNA–disease association sparse matrix, an lncRNA–disease network was reconstructed according to linear neighborhood similarities. Then the unbalanced double random walk algorithm was used to calculate the prediction score. However, it is still challenging to predict potential lncRNA–disease associations accurately in the absence of the known lncRNA–disease association information.

Another major type of computational models is based on the machine learning algorithm, and the main characteristic of them is to train a classifier based on machine learning algorithms according to the biological features of lncRNAs and diseases. Chen and Yan (2013) reported a computational method of Laplacian regularized least squares for predicting lncRNA–disease associations (LRLSLDA) in a semi-supervised learning framework. In 2015, a naive Bayesian classifier–based model was proposed by Zhao et al. (2015) to predict potential lncRNA–disease associations. Chen et al. (2015) proposed two novel lncRNA functional similarity calculation models (LNCSIM), which were evaluated by introducing similarity scores into the LRLSLDA model. Lan et al. (2017) integrated a variety of gene data and trained a classifier with the bagged support vector machine for their lncRNA–disease association prediction model. Lu et al. (2018) developed a model called SIMCLDA to predict the potential lncRNA–disease associations based on the inductive complement matrix. Guo et al. (2019) proposed a LDASR model based on collaborative filtering and machine learning. Xuan et al. (2019) developed a dual convolutional neural network with attention mechanisms for predicting disease-related lncRNAs. Zeng et al. (2020) designed a hybrid computing framework called SDLDA based on linear and non-linear features of lncRNAs and diseases, and created fused features for the full connection layer for prediction. Sheng et al. (2021) constructed a deep learning prediction model, VADLP, which applied autoencoders for representation learning of lncRNA and disease features. Wu et al. (2020) adopted graph autoencoder to predict lncRNA–disease associations on lncRNA–disease bipartite graph. One of the main limits of these models based on machine learning methods is lacking the negative samples during the classifier training. For giving readers a clear overview, Supplementary File S1 induces the aforementioned models in a tabular form.

Inspired by the great success of the EAGCN method on the chemical molecule property recognition problem, the prediction of lncRNA–disease associations could be regarded as a component recognition problem in the lncRNA–disease characteristic graph. In order to fully mine core features of lncRNA–disease associations in a graph with minimum redundant features, the structure hidden in the closest node weight graph among the spatial neighborhoods of lncRNA–disease associations (CNWGSN) has been developed in this study that combined with the biological features of lncRNAs and diseases. It considered not only the features of disease–disease, lncRNA–lncRNA, and lncRNA–disease relations but also the lncRNA–disease features in a multidimensional space. Moreover, CNWGSN was used to provide a great logic and mathematical supports for the edge attention graph convolutional networks (EAGCNs) (Shang et al., 2018) for summarizing and extracting the internal features between lncRNAs and diseases. Thus, an lncRNA–disease association prediction model based on the edge attention graph convolutional network (LDA-EAGCN) was proposed; the multiple edge relations in multiple graphs of lncRNAs and diseases were used to train EAGCN in LDA-EAGCN. Additionally, to unravel the lack of negative samples for training the classifier, the network-based random walk with a restart algorithm was adopted in our study. The low score samples from lncRNA–disease associations were selected randomly as negative samples. The 10-fold cross-validations and numerical experiments illustrate that LDA-EAGCN outperformed the tested five state-of-the-art models, and the AUC value of LDA-EAGCN reached 0.9853. Moreover, the case studies of renal cell carcinoma, laryngeal cancer, and liver cancer indicated that LDA-EAGCN is capable of detecting potential lncRNA–disease associations; most of the top ten predicted lncRNAs of each case study (24 of the 30) which are most likely to have associations with the diseases have been proved by recently published experimental literature works.

## Materials and Methods

### LncRNA–Disease Associations

One dataset that is used in the study is downloaded from the Lnc2Cancer 3.0 database (Ning et al., 2016); it contains 3919 lncRNA–disease associations involving 198 diseases and 639 lncRNAs. The other dataset is downloaded from the LncRNADisease v2.0 database (Chen et al., 2013); it includes 2453 lncRNA–disease associations among 378 diseases and 472 lncRNAs. All these associations have been verified by biological experiments. In addition, a controlled and hierarchical medical vocabulary is collected from the MeSH vocabulary database (Nelson et al., 2001) for standardizing these disease names. MeSH is a biomedical subject vocabulary which has high authority in the field of medicine. After standardizing all the datasets and removing duplicated data, finally, 4715 lncRNA–disease associations of 786 lncRNAs and 292 diseases were obtained.

### LncRNA–Disease Correlation Matrix

The numbers of obtained lncRNAs and diseases are labeled as

$nl$

and

$nd$

, respectively; then the lncRNA–disease correlation matrix (LDCM) is constructed,

$LDCM∈Rnl∗nd$

. The following formula can be used to calculate the value of

$LDCM(i,j)$

:

In this way, the abstract correlations between lncRNAs and diseases are represented by a two-dimensional matrix which is intuitive, concise, and convenient for subsequent calculations.

### Disease Semantic Correlation

In the calculation of the semantic similarity of disease, each disease name has been represented by the MESH descriptor, and a directed acyclic graph (DAG) is structured. In the DAG, all nodes are connected by a direct edge from a more general term to a more specific term. A semantic similarity algorithm was proposed based on the hierarchical structure of disease terms (Wang et al., 2010). It makes full use of the internal branch structure of diseases, and the calculated disease similarity has sufficient theoretical support. The semantic similarity algorithm consists of three main processing steps.

Step 1: The relationship between the disease node

$d$

and the diseases in the branches involving disease

$d$

is extracted, which is named as

. Using the extracted

graph, the semantic contribution value

$Dd$

is calculated according to the disease branch structure shown in

. The shortest path from

$Td$

(the set of all ancestor nodes of

$d$

including

$d$

itself) to disease

$d$

in

usually contains less branches and possesses less disease nodes in the path, which means a stronger correlation between

$Td$

and disease

$d$

. The semantic contribution value will be reduced at each intermediate node passing through disease

$d$

, which has been repeatedly verified by previous studies. The semantic value

$Dd(t)$

of disease d can be calculated based on the

:

The semantic contribution factor for edges linking disease

$t$

with its child disease

$t′$

is defined as Δ, which is set to 0.5 in our studies. In the

, when there are multiple paths between

$Td$

and disease d, the shortest path contribution value is treated as the maximum semantic contribution value.

Step 2: Based on Eq. 2, the semantic value of disease d was calculated as Eq. 3:

Step 3: According to the semantic values of diseases

$A$

and

$B$

, the semantic similarity value

$DS(A,B)$

of diseases

$A$

and

$B$

is calculated as Eq. 4. The common diseases of

and

are screened out, and their semantic contributions to diseases

$A$

and

$B$

are summed. The proportion of the semantic contribution value of the sum to the semantic value of diseases

$A$

and

$B$

is regarded as the similarity value of diseases

$A$

and

$B$

.

$DS(A,B)=∑t∈TA∩TB(DA(t)+DB(t))DV(A)+DV(B).(4)$

Ultimately, the semantic similarity matrix of diseases is gained, and it is quick to obtain the semantic similarity between arbitrary two diseases.

### LncRNA Function Correlation

Based on the assumption that lncRNAs with similar functions may have a good likelihood of associating with similar diseases, the functional similarities of the lncRNAs can be calculated by the similarities of the diseases associated with them. Chen et al. developed novel lncRNA functional similarity calculation models for lncRNA–disease association prediction (Chen et al., 2015). In the study, these calculation models were also borrowed.

$Lm$

denotes lncRNA lm,

$Ln$

denotes lncRNA ln, and the diseases associated with

$Lm$

are represented by

$dmi$

. All the diseases associated with

$Lm$

become a set

$DTm={dm1,dm2…,dmm}$

, and the diseases related to

$Ln$

are represented by the set

$DTn={dn1,dn2…,dnn}$

. The core idea here is to calculate the functional similarity between

$Lm$

and

$Ln$

by using the similarity values of diseases in

$DTm$

and

$DTn$

. First, the similarity values of disease

$dmi$

in

$DTm$

and all diseases in

$DTn$

are calculated in turn, and the maximum similarity value is considered as the minimum distance of the disease set

$DTn$

associated with disease

$dml$

and

$Lm$

. Second, the calculation formula for the maximum disease score

$S(dml,DTn)$

is shown in formula Eq. 5. Similarly, the minimum distance between all diseases in the disease set

$DTn$

associated with

$Ln$

and the disease set

$DTm$

associated with

$Lm$

is obtained. Finally, the ratio of the maximum disease score of

$DTm$

and all diseases in

$DTn$

to the number of elements in

$DTm$

and

, respectively, is calculated, and the functional similarity score of

$Lm$

and

$Ln$

,

$SCORE(Lm,Ln)$

, is shown in formula Eq. 6:

$S(dml,DTn)=MAX1≤i≤n(S(dml,dni)),(5)$
$SCORE(Lm,Ln)=∑1≤i≤mS(dmi,DTn)+∑1≤j≤nS(dti,DTm)m+n.(6)$

The specific values of disease semantic similarity matrices and lncRNA similarity matrices are offered in Supplementary Files S2, S3, respectively.

### Negative Samples

In order to better train the LDA-EAGCN model, the random walk with restart (RWRH) algorithm was used to generate negative samples for training the prediction model based on heterogeneous networks in the study by Li and Patra (2010). This model sorts the possibilities of all associations according to the network structures and screens lncRNA–disease pairs with low correlation scores as negative samples.

The RWRH algorithm mainly consists of three steps. First, the method begins by generating the lncRNA nodes and disease nodes, and the heterogeneous network of their associations or similarities. Second, a seed node is selected as the starting node of the ergodic. Third, it is to construct the transition matrix to bridge every jump of the ergodic. Finally, negative samples in proportion to positive samples are randomly generated from lncRNA–disease pairs with low association probabilities; the detailed prediction results are provided in Supplementary File S4.

### Edge Attention Graph Convolution Networks

A convolutional neural network (CNN) is a kind of deep neural network which is widely used in biomedical relation detection. A graphical convolution neural network (GCN) is generalization of CNN to work with arbitrarily structured graphs. The edge attention–based multi-relational graph convolutional network (EAGCN) (Shang et al., 2018) is a novel model which accurately excavates multiple edge relations and extracts node features in multiple graphs.

The flowchart of EAGCN is shown in Figure 1. It consists of four layers and three fully linked layers; each layer contains five blocks, and there are Conv2d convolution and GraphCov_base convolution based on graph convolution in each block. It was applied originally to deep learning in the chemical direction researches and directly learned the molecular properties of compounds from the molecular graphs.

FIGURE 1. Flowchart of EAGCN method.

In our study, the prediction of lncRNA–disease associations was treated as a binary classification problem of the component recognition based on the lncRNA–disease characteristic graph. The structural information of lncRNA–disease associations is substituted into a convolutional neural network for training the classifier of our predicting model.

### LDA-EAGCN Model

Although high-dimensional features of lncRNA–disease association have not been clearly captured and cannot be directly detected by the extractions of multilayered deep learning methods, the internal logic and rules of high-dimensional features of lncRNA–disease association would be used to predict the unknown relationships between lncRNAs and diseases. In order to introduce the EAGCN algorithm into LncRNA–disease association prediction, the graphs of lncRNA–disease association pairs were first constructed. For fully excavating internal logic features and decreasing functional redundancy of lncRNA–disease association, the structure of the closest node weight graph of the spatial neighborhood of lncRNA–disease (CNWGSN) was subsequently proposed. It combined with the biological features of lncRNAs and diseases, and can provide great logic and mathematical support for EAGCN to learn and summarize the internal relationship between lncRNAs and diseases. CNWGSN takes into account not only the features of disease–disease relationship, lncRNA–lncRNA relationship, and known lncRNA–disease associations between diseases and lncRNAs but also the known features of lncRNAs and diseases in a multidimensional feature space.

Based on the above, a novel model, LDA-EAGCN, which comprises the following three main steps was proposed.

Step 1: Structure the adjacency matrix of lncRNA–disease associations and calculate the diseases–diseases semantic correlation matrix

$DDCM∈Rdn∗dn$

and the lncRNA–lncRNA functional correlation matrix

$LLCM∈Rln∗ln$

.

Step 2: Structure the closest node weight graph of the spatial neighborhood of lncRNA–disease (CNWGSN) of lncRNA–disease associations. It contains two classes of nodes, lncRNA li and disease di, which are from the lncRNA–disease correlations (LDC). M top-ranking disease nodes,

$Di={di1,…,dii…,diM}$

, are most closely related with the disease semantics of

$Di$

in the disease–disease semantic correlation matrix (DDSCM), and N top-ranking lncRNA nodes,

$Li={li1,…,lii…,liN}$

, are also most closely related with the function similarities of lncRNA

in LLCM. In the topological sense, the closest lncRNA node weight graph (CLNWG) of lncRNA

is constructed according to the LLCM. The M top-ranking lncRNA nodes closely related to lncRNA

$i$

are screened out to establish nodes. The weights of CLNWG are taken as the correlation values of the LLCM. In the same way, the N top-ranking disease nodes closely related to disease

$i$

are screened out to establish nodes. The weights of disease

$i$

in CLNWG are adopted as the correlation value of the DDSCM. Then the closest node weight graph of

$Li$

and

$Di$

and the spatial neighborhood features are integrated into the CNWGSN features of lncRNA

and disease

$i$

.

The edges of CNWGSN features graph are divided into four categories. The predicted edges which need to be predicted between the input lncRNAs and the disease

, the spatial neighborhood edges that are the association are known between diseases and lncRNAs

, the lncRNA edges that carry lncRNA function correlation

$dd_att∈R(N+M)∗(N+M)∗2$

, and the disease edges that have disease semantic correlation

$ll_att∈R(N+M)∗(N+M)∗2$

. The calculating formulas of four kinds of edges are shown as follows.

Step 3: The features are extracted from lncRNA–disease associations with CNWGSN, and they are treated as the training samples of EAGCN. In parallel, the constructing negative samples of lncRNA–disease associations are introduced into the training, which helps to improve the prediction accuracy of correlation scores. The flowchart of LDA-EAGCN is shown in Figure 2.

FIGURE 2. Flowchart of LDA-EAGCN. (A) Construction and calculation of lncRNA–disease correlation matrix (LDCM), disease–disease semantic correlation matrix (DDSCM), and lncRNA–lncRNA function correlation matrix; (B) constructing the closest lncRNA node weight graph (CLNWG) of the lncRNA and disease in lncRNA–disease correlations (LDCs); (C) training edge attention graph convolution networks (EAGCN); (D) predicting correlation scores of input data with EAGCN.

## Results

### Implementation Details of LDA-EAGCN

After specification naming and redundancy removal, all 4715 known lncRNA–disease associations were labeled as positive samples, and an equal number of negative samples with the RWRH method was constructed. These samples are included as the data of prediction performance self-assessment of the LDA-EAGCN model. During the training, the optimized parameters of the EAGCN model are adopted for avoiding the problems of overfitting and poor generalization ability, such as the packet loss rate

$dr$

= 0.3 and the learning rate

$α$

### Evaluation Methods and Metrics

To ensure the reliability of the predictive results, a 10-fold cross-validation experiment is employed to evaluate the LDA-EAGCN model, and the total data are divided into 10 parts equally. This 10-fold cross-validation would be cycled 10 times to guarantee each data part is used as a validation set one time. Then a total of 10 training sessions are conducted, and the average model performance is regarded as the final result. The ROC curve is used to evaluate the performance of the LDA-EAGCN model, and it can describe the relationship between the true positive rate (TPR) and false positive rate (FPR) under different thresholds. The larger the area value of AUC under the ROC curve, the better the prediction performance. In the 10-fold cross-validation of the LDA-EAGCN model, the average AUC value reached 0.9854 (Figure 3). We also did a 5-fold cross-validation experiment, and the average AUC value reached 0.9885 (Figure 4).

FIGURE 3. ROC curves of LDA-EAGCN in different situations of 10-fold cross-validations.

FIGURE 4. ROC curves of LDA-EAGCN in different situations of 5-fold cross-validation.

To confirm whether the experimental results of LDA-EAGCN are over fitted, one-tenth of the samples was further separated as an independent dataset, and remaining examples were used for training the classifier in the LDA-EAGCN. The ROC curves of the training set, the testing set, and the validation set are shown in Figure 5. The AUC value of LDA-EAGCN achieved 0.9843 on the validation set, which demonstrated that the excellent performance of 10-fold cross-validations was not generated by overfitting.

FIGURE 5. ROC curves of independent testing.

In addition, in order to comprehensively evaluate LDA-EAGCN, some metrics, such as accuracy (ACC), sensitivity (SEN), specificity (SPEC), precision (PREC), and Matthews correlation coefficient (MCC), were particularly added. More details of these metrics can be seen in Tables 13.

TABLE 1. Results of 10-fold cross-validation.

TABLE 2. Results of 5-fold cross-validation.

TABLE 3. Results of independent testing.

In order to prove that each association network has an impact on the performance of the model, each associated network was deleted in turn to build the subgraphs, and the performance of the model was calculated. The results demonstrated that our model achieved the best performance when all associated networks were used for calculation. The detailed results can be seen in Supplementary File S6.

### Comparison With Other Models

In our study, the LDA-EAGCN model was compared with other five state-of-the-art models for lncRNA–disease association prediction including LDA-LNSUBRW (Xie et al., 2020), LDASR (Guo et al., 2019), NCPHLDA (Xie et al., 2019), SDLDA (Zeng et al., 2020), and TPGLDA (Ding et al., 2018). The LDA-LNSUBRW model is an lncRNA–disease association prediction method based on linear neighborhood similarity and unbalanced double random walk; the LDASR model obtains feature vectors by integrating lncRNA Gaussian interaction spectrum kernel similarity, disease semantic similarity, and Gaussian interaction spectrum kernel similarity, and finally uses the rotating forest algorithm for predicting lncRNA–disease associations; NCPHLDA integrates the lncRNA cosine similarity network, disease cosine similarity network, and known lncRNA–disease association network, and predicts by network consensus projection; SDLDA is a hybrid computing framework, which uses singular value decomposition and deep learning to extract linear and non-linear features of lncRNAs and diseases, respectively, and then combines linear and non-linear features training; TPGLDA is a novel lncRNA–disease association prediction method based on lncRNA–disease triad, which combines gene–disease association and lncRNA–disease association. Each model in comparison was trained with the same training set and tested with the same test set in the cross-validation.

The ROC and PR curves of all the models in comparison are given in Figures 6, 7. The AUC values under ROC curve of the LDA-EAGCN model are 0.1141, 0.0317, 0.0966, 0.0468, and 0.0815 higher than those of the SDLDA model, LDASR model, LDA-LNSUBRW model, TPGLDA model, and NCPHLDA model, respectively, which reaches 0.9853. The AUPR values of the LDA-EAGCN model are 0.5047, 0.0407, 0.641, 0.3813, and 0.6618 higher than those of the SDLDA model, LDASR model, LDA-LNSUBRW model, TPGLDA model, and NCPHLDA model, respectively, which reaches 0.9820. The overview of data involved in each comparison model is exhibited in Supplementary File S7.

FIGURE 6. ROC curves of all the models in comparison.

FIGURE 7. PR curves of all the models in comparison.

### Negative Sample Comparison

In order to examine the reliability of the negative samples used in the experiments, the RWRH negative samples, in terms of the associations that have lower scores in the RWRH algorithm, are compared with those randomly selected unknown lncRNA–disease associations. In 10-fold cross-validation, the AUC values of RWRH negative samples and randomly selected negative samples are 0.9853 and 0.9632, respectively (Figure 8). These experiments indicate the reliability of the method for generating negative samples in LDA-EAGCN.

FIGURE 8. ROC curves of Negative sample comparison.

### Case Studies

In order to further demonstrate the predictive ability of the LDA-EAGCN model, case studies were performed over kidney cancer, laryngeal cancer, and liver cancer. First, 4715 pairs of known lncRNA–disease associations and the equivalent generated negative samples were adopted for model training. Then the weight graph of the closest nodes in the spatial contextual of these three diseases and lncRNAs with the unknown associations related with the three diseases are generated, respectively, which are used as the input of LDA-EAGCN. The predictive correlation scores of unknown lncRNA–disease associations between the interested diseases and their unknown lncRNAs are gained. Finally, the predictive correlation scores are sorted in a descending order, and the top 10 lncRNAs with the highest scores of these three diseases are document mined. Among the top ten lncRNAs corresponding to renal cell carcinoma, laryngeal cancer, and liver cancer, eight lncRNAs associated with each disease are supported by recent biological experiments’ literature works, which indicate the LDA-EAGCN model has good performance in predicting unknown relationships. The scores of each lncRNA–disease pair in the experimental data are available in Supplementary File S8.

Kidney neoplasm is a cancer that originates from kidney tissues, which is one of the ten most common cancers, and renal cell carcinoma composes the vast majority of kidney cancer cases (Linehan and Rathmell, 2012). Despite expending high efforts to study kidney neoplasms in biogenetics, there are still great doubts about the occurrence of kidney neoplasms. In order to confirm the validity of the model, LDA-EAGCN was implemented to predict potential kidney neoplasm–related lncRNAs. As a result, eight out of top ten potential lncRNAs related with kidney neoplasms have been validated by recent biological experiments’ literature works (Table 4), which were ranked 1st, 2nd, 3rd, 4th, 6th, 7th, 9th, and 10th in the prediction results, respectively. For example, recent studies have found that CDKN2B-AS1 can be used as a biomarker for poor prognosis of kidney neoplasms (Angenard et al., 2019), DUXAP8 enhances the progression of kidney neoplasms by downregulating miR-126 (Huang et al., 2018), and HOTAIRM1 is downregulated in kidney neoplasms and inhibits hypoxia (Hamilton et al., 2020).

TABLE 4. Case study results of kidney neoplasms.

Laryngeal neoplasm is a common malignant tumor that accounts for 4.5% of systemic malignancies, and it is also the second largest malignant tumor of head and neck malignant tumors (Obid et al., 2019). The loss of laryngeal function will greatly affect language expression and swallowing function with some special senses. Therefore, it is imperative to identify novel lncRNAs for early diagnosis, prognosis, and treatment of laryngeal neoplasms. Accumulating evidence has demonstrated that lncRNAs have played critical roles in the development and progression of laryngeal neoplasms (Xiang et al., 2019; Zhang G et al., 2019; Li et al., 2020). LDA-EAGCN was further implemented to identify lncRNAs associated with laryngeal neoplasms. As a result, eight out of top ten potential lncRNAs related with laryngeal neoplasms have also been validated by recent biological experiments’ literature works (Table 5), which were ranked 1st, 2nd, 3rd, 4th, 5th, 7th, 8th, and 9th in the prediction results, respectively. For example, CDKN2B-AS1 regulates the cell cycle of laryngeal neoplasms (F. Liu et al., 2020), PVT1 regulates miR-519d-3p to promote the development of laryngeal neoplasms (Zheng et al., 2019), and CCAT1 regulates the progression of laryngeal neoplasms (Zhang and Hu, 2017) through different ways. Notably, the model predicts that lncRNA GAS5, which scored second, inhibits proliferation and metastasis of laryngeal neoplasms by regulating the PI3K/AKT/mTOR signaling pathway, according to a recent study in 2020 (Liu et al., 2021).

TABLE 5. Case study results of laryngeal cancer.

Liver neoplasm is a common malignant cancer globally, and it is the second leading cause of cancer death worldwide (Yamashita and Kaneko, 2016). Liver neoplasms are a special kind of cancer, and their occurrence and development rate often depend on the host, disease, and environmental factors and their complex interactions. Numerous experimental results prove that the development and progression of liver neoplasms are closely related to the mutations and dysregulations of some lncRNAs (Wang et al., 2017; Zhang Z et al., 2019; Zhang et al., 2020). LDA-EAGCN is applied to liver neoplasms for potentially related lncRNA prediction. By mining recent biological experiments’ literature works, eight out of top ten potential lncRNAs related with liver neoplasms are validated (Table 6), which were ranked 1st, 2nd, 3rd, 5th, 6th, 8th, 9th, and 10th in the prediction results, respectively. For example, BANCR can be used as a potential therapeutic target for liver neoplasms (Zhou and Gao, 2016), NEAT1 is necessary for liver neoplasm marker CD44 expression (Koyama et al., 2020), and LINC00473 promotes the progression of liver cancer by acting as microRNA-195 ceRNA and increasing HMGA2 expression (Mo et al., 2019).

TABLE 6. Case study results of liver cancer.

## Discussion

In this study, a model based on close node weight graph of the spatial neighborhood and edge attention graph convolutional networks was proposed to predict disease-related lncRNAs by multisource data. Inspired by the great success of the EAGCN method on the chemical molecule property recognition problem, the prediction of lncRNA–disease associations could be regarded as a component recognition problem of the lncRNA–disease characteristic graph. The CNWGSN features of lncRNA–disease associations combined with known lncRNA–disease associations have been introduced to train the EAGCN method, and the correlation scores of input data were predicted with EAGCN for judging whether the input lncRNAs are associated with the input diseases.

In order to excavate core features of lncRNA–diseases relationship in a graph and remove redundancy, the closest node weight graph of the spatial neighborhoods (CNWGSNs) of lncRNA–disease associations was constructed. It not only considers the features of disease–disease relationship, lncRNA–lncRNA relationship, and the association between disease and lncRNA but also considers the features of lncRNA and disease in a multidimensional space. In addition, CNWGSN can also provide a great logic and mathematical support for EAGCN to learn and summarize the internal relationship between lncRNA and disease. Then the features of lncRNA–disease are trained into the edge attention-based multi-relational graph convolutional networks (EAGCNs), which accurately learn multiple edge relations in multiple graphs. For solving the problem of missing negative samples, the RWRH algorithm is adopted to randomly select lncRNA–disease pairs with low correlation scores as negative samples.

Our model LDA-EAGCN gets better performance in the 10-fold cross-over test, and the mean AUC of it reached 0.9853, which is higher than that of other five state-of-the-art models. As for the experiments of case studies, in the top ten lncRNAs of kidney cancer, laryngeal cancer, and liver cancer, 24 of all 30 lncRNAs were verified to be associated with the diseases.

Although the model can achieve good results, there is still room for improvement. At present, the model only uses lncRNA–disease data, and more types of biological data and more elaborately designed fusion methods can be applied in the future.

## Data Availability Statement

The original contributions presented in the study are included in the article/Supplementary Material, further inquiries can be directed to the corresponding authors.

## Author Contributions

JL conceived and designed the study; MK and DW developed the algorithm and performed the statistical analysis; MK, ZY, and XH wrote the codes; MK drafted the original manuscript; JL and XH revised the manuscript. All authors read and approved the final manuscript.

## Funding

This work was supported by the National Natural Science Foundation of China (grant Nos. 81672113 and 62072154) and the Natural Science Foundation of Hebei Province (grant No. C2018202083).

## Conflict of Interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

## Publisher’s Note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors, and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

## Acknowledgments

We thank members in our groups for their valuable discussions.

## Supplementary Material

Supplementary File 1 | Summarization of lncRNA–disease prediction models.

Supplementary File 2 | Disease semantic similarity scores.

Supplementary File 3 | LncRNA functional similarity scores.

Supplementary File 4 | LncRNA–disease correlation scores of RWRH.

Supplementary File 5 | Experimental details.

Supplementary File 6 | Details of deleted associated networks.

Supplementary File 7 | Details of data involved in each comparison model.

Supplementary File 8 | LncRNA–disease correlation scores of LDA-EAGCN predictions.