# Document-level medical relation extraction via edge-oriented graph neural network based on document structure and external knowledge – BMC Medical Informatics and Decision Making

#### ByTao Li, Ying Xiong, Xiaolong Wang, Qingcai Chen and Buzhou Tang

Dec 30, 2021

In this section, we first introduce RE based on document structure, and then RE based on external knowledge from two aspects: knowledge graph and entity description.

### Relation extraction based on document structure

A document usually has a hierarchical structure like an example, as shown in Fig. 2, where a document (d_{1}) consists of two chapters (c_{1}) and (c_{2}), and each chapter contains some sentences with many entity mentions. Suppose that a sentence (s = w_{1} w_{2} ldots w_{left| s right|}), it can be represented as (H_{s}^{local} = left[ {h_{1}^{local} ,h_{2}^{local} , ldots ,h_{left| s right|}^{local} } right]) via an encoding layer.

In a document with |d| sentences (d = s_{1} ,s_{2} , ldots ,s_{left| d right|}), there are five kinds of nodes corresponding to document structure as follows:

• Mention Node (M). Each mention node (m) is represented as (n_{m} = { }left[ {avg_{{w_{i} in m}} left( {h_{i}^{local} } right);t_{m} } right]), where ‘;’ denotes concatenation operation, and (t_{m}) is an embedding to represent the node type of mention node.

• Entity Node (E). An entity (e) is represented as (n_{e} = left[ {avg_{m in e} left( {n_{m} } right);t_{e} } right]), where (avg_{m in e} left( {n_{m} } right)) is the average representation of all mentions corresponding to (e), and (t_{e}) is an embedding to represent the node type of entity node.

• Sentence Node (S). Each sentence node (s) is represented as (n_{s} = { }left[ {avg_{{w_{i} in s}} left( {h_{i}^{local} } right);t_{s} } right]), where (t_{s}) is an embedding to represent the node type of sentence.

• Chapter Node (C). A chapter node c is represented by the average representation of all sentence nodes it contains and the embedding of the node type of chapter, that is, (n_{c} = left[ {avg_{s in c} left( {h_{s}^{global} } right);t_{c} } right]);

• Document Node (D). A document node (d) is represented by the average representation of all chapter nodes and the embedding of the node type of document (n_{d} = left[ {avg_{{{text{c}} in d}} left( {n_{c} } right);t_{d} } right]).

Given the five kinds of nodes above, we connect them with the following six kinds of edges, as shown in Fig. 3:

• Mention-Sentence (MS). When an entity mention (m) appears in a sentence, there is an edge between the corresponding entity mention node and the sentence node (s), and the edge is represented as (e_{MS} = left[ {n_{m} ;n_{s} } right]);

• Mention-Mention (MM). When two entity mentions (m_{1}) and (m_{2}) appear in the same sentence (s), there is an edge between the two corresponding entity mention nodes (n_{{m_{1} }}) and (n_{{m_{2} }}). The edge can be represented as (e_{MM} = left[ {n_{{m_{1} }} ;n_{{m_{2} }} ;c_{{m_{1} m_{2} }} ;dleft( {s_{1} ,{text{s}}_{2} } right)} right]), where (dleft( {m_{1} ,{text{m}}_{2} } right)) is the representation of the relative distance between the two entity mentions in the sentence, and (c_{{m_{1} m_{2} }}) is the attention vector between the two entity mentions calculated by the following equations:

$$begin{array}{*{20}c} {alpha_{k,i} = n_{{m_{k} }}^{T} w_{i} ,} \ end{array}$$

(1)

$$begin{array}{*{20}c} {a_{k,i} = frac{{exp left( {alpha_{k,i} } right)}}{{mathop sum nolimits_{{j in left[ {1,n} right],j notin m_{k} }} exp left( {alpha_{k,j} } right)}},} \ end{array}$$

(2)

$$begin{array}{*{20}c} {a_{i} = frac{{a_{1,i} + a_{2,i} }}{2},} \ end{array}$$

(3)

$$begin{array}{*{20}c} {c_{{m_{1} ,m_{2} }} = H^{T} a,} \ end{array}$$

(4)

where (k in left{ {1,2} right}), (a_{i}) is the attention weight of the ith word in the entity mention pair <(m_{1} ,m_{2})>, and (H in R^{hidden_dim times left| s right|}) is the representation of sentence (s);

• Entity-Mention (ME). There is an edge between an entity mention node (m) and the corresponding entity node (e), that is, (e_{ME} = left[ {n_{m} ;n_{e} } right]);

• Sentence-Sentence (SS). For all sentence nodes in a document, there are edges between any two sentence nodes. An SS edge is represented by (e_{SS} = left[ {n_{{S_{i} }} ;n_{{s_{j} }} ;dleft( {s_{i} ,{text{s}}_{j} } right);left| {n_{{S_{i} }} – n_{{s_{j} }} } right|} right](i ne j)), where (n_{{S_{i} }}) and (n_{{S_{j} }}) are the representation of (s_{{text{i}}}) and the representation of (s_{j}), and (dleft( {s_{i} ,{text{s}}_{j} } right)) is the representation of the relative distance between (s_{{text{i}}}) and (s_{j}) measured by the number of sentences between them;

• Entity-Sentence (ES). When there is an entity mention node (m) corresponding to an entity node (e) in a sentence (s), there is an edge between (e) and (s). The edge is represented as (e_{ES} = left[ {n_{e} ;n_{s} } right]);

• Sentence-Chapter (SC). There is an edge between a sentence node (s) and a chapter node (c), and it is represented as (e_{SC} = left[ {n_{s} ;n_{c} } right]);

• Chapter-Chapter (CC). There is an edge between two chapter nodes (c_{1}) and (c_{2}) in a document, and it is represented as (e_{CC} = left[ {n_{{c_{1} }} ;n_{{c_{2} }} } right]);

• Chapter-Document (CD). There is an edge between a chapter node (c) and a document node (d), and it is represented as (e_{DC} = left[ {n_{d} ;n_{c} } right]).

We further apply a linear transformation to all edge representations using the following equation:

$$begin{array}{*{20}c} {v_{z}^{left( 1 right)} = {varvec{W}}_{z} e_{z} ,} \ end{array}$$

(5)

where (z in left{ {MS,{ }MM,ME,SS,ES,SC,CC,CD} right}) and ({varvec{W}}_{z}) is a learnable parameter matrix.

### Relation extraction based on external knowledge

To utilize external knowledge, we regard any entity in external knowledge that also appears in text as an additional node and connect it to the corresponding entity node in text. In this paper, we introduce two kinds of knowledge nodes according to the forms of external knowledge of entities: (1) entity description and (2) knowledge graph.

Suppose that (e_{1}), (e_{2}) and (e_{3}) have their external description, (e_{1}) and (e_{3}) exist in an external knowledge graph, the graph based on document structure as shown in Fig. 3 can be extended to the graph as shown in Fig. 4 after adding knowledge nodes, where (kd_{i}) and (ks_{j}) denote knowledge node based on entity description and knowledge node based on knowledge graph, respectively. In this way, we can obtain a graph that takes full advantage of external knowledge as much as possible.

### Knowledge node representation based on knowledge graph

We deploy a translation distance model, a semantic matching model and a graph model, that is, TransE [12], RESCAL [13] and GAT [14], to represent knowledge nodes based on knowledge graph respectively.

TransE assumes that any triple (leftlangle {h, r, t} rightrangle), where (h) is a head entity node, (r) is a relation, and (t) is a tail entity node, satisfies the hypothesis of (h + r approx t), so as to ensure that the distance between two entity nodes is close to the representation of the relation between the two nodes. In this way, the multi-hop relation between two entities can be represented by additive transitivity, that is, if there is a relation (r_{1}) between (h_{1}) and (t_{1}), a relation (r_{2}) between (t_{1}) and (t_{2}), …, and a relation (r_{K}) between (t_{K – 1}) and (t_{K}), there is an implicit relation between (h_{1}) and (t_{K}) as follows:

$$begin{array}{*{20}c} {h_{1} + r_{1} + r_{2} + cdots + r_{K} approx t_{K} ,} \ end{array}$$

(6)

The max-margin function of negative sampling is used as the objective function of TransE:

$$begin{array}{*{20}c} {L = mathop sum limits_{{left( {h,r,t} right) in Delta }} mathop sum limits_{{left( {h^{prime},r^{prime},t^{prime}} right) in Delta^{prime}}} maxleft( {f_{r} left( {h,t} right) + gamma – f_{{r^{prime}}} left( {h^{prime},t^{prime}} right),0} right),} \ end{array}$$

(7)

where (left( {h,r,t} right) in {Delta }) is a true triplet, while (left( {h^{prime},r^{prime},t^{prime}} right) in{Delta^{prime}}) is a negative triplet obtained by sampling, (f_{r} left( {h,t} right)) is the score of (left( {h,r,t} right)), and (gamma > 0) denotes the margin usually set to 1. Finally, the learned (h) is regarded as (h_{ks}), the knowledge node representation corresponding to node (ks) without considering its type.

RESCAL captures the potential semantics between two entities through the bilinear function as follows:

$$begin{array}{*{20}c} {f_{r} left( {h,t} right) = h^{T} M_{r} t,} \ end{array}$$

(8)

As shown in Fig. 5, RESCAL represents relation triples as a three-dimensional tensor ({mathcal{X}}), where ({mathcal{X}}_{ijk} = 1) indicates that there is a true triplet (leftlangle {e_{i} ,r_{k} ,e_{j} } rightrangle). The tensor decomposition model is used to model the relationship implicitly:

$$begin{array}{*{20}c} {{mathcal{X}}_{k} approx AR_{k} A^{T} ,;{text{for }};k = 1, ldots ,m,} \ end{array}$$

(9)

where ({mathcal{X}}_{{text{k}}}) is the (k)th component of ({mathcal{X}}), (A in R^{n times r}) contains the potential representations of entities, (R_{k} in R^{r times r}) is a symmetric matrix used to model the potential interactions in the (k)th relation:

$$begin{array}{*{20}c} {fleft( {A,R_{k} } right) = frac{1}{2}mathop sum limits_{i,j,k} left( {{mathcal{X}}_{ijk} – {varvec{a}}_{i}^{T} R_{k} {varvec{a}}_{j} } right)^{2} ,} \ end{array}$$

(10)

where (h_{ks}) is the component of (A) corresponding to node (ks).

In addition, we also represent the knowledge node (ks) by the subgraph centered on the node using GAT.

Based on knowledge graph, a node (ks) is represented by (n_{ks} = left[ {h_{ks} ;t_{ks} } right]), where (h_{ks}) is the representation obtained from TransE, RESCAL or GAT, and (t_{ks}) is the embedding of the node type of knowledge graph node. The edge between an entity node (e) and the corresponding knowledge node (k{text{s}}) is represented as (e_{EKS} = left[ {n_{e} ;n_{ks} } right]), and it is also further transformed into (v_{EKS}^{left( 1 right)}) via a linear transformation function:

$$begin{array}{*{20}c} {v_{EKS}^{left( 1 right)} = {varvec{W}}_{EKS} e_{EKS} ,} \ end{array}$$

(11)

where ({varvec{W}}_{EKS}) is a learnable parameter matrix.

### Knowledge node representation based on description

In this paper, we use the following two methods to obtain knowledge node representation based on the entity description:

1. 1.

Doc2vec [15] (also called paragraph2vec), inspired by word2vec [16] proposed by Tomas Mikolov, which can transform a sentence or a short text into a corresponding low dimensional vector representation of fixed length.

2. 2.

An end-to-end neural network, as shown in Fig. 6, which are used to encode the description text of a given knowledge node, called EMB.

Similar to knowledge node (ks), knowledge node (kd) based on description is represented as (n_{kd} = left[ {h_{kd} ;t_{kd} } right]). The edge between (kd) and the corresponding entity node (e) is represented as (e_{EKD} = left[ {n_{e} ;n_{EKD} } right]) and is further transformed by

$$begin{array}{*{20}c} {v_{EKD}^{left( 1 right)} = {varvec{W}}_{EKD} e_{EKD} ,} \ end{array}$$

(12)

where ({varvec{W}}_{EKD}) is a learnable parameter matrix.

### Inference

Following KEoG, with the help of the walk aggregation layer [17], a path between two entity nodes (i) and (k) of length (2l) can be represented as

$$begin{array}{*{20}c} {fleft( {v_{ik}^{left( l right)} ,v_{kj}^{left( l right)} } right) = sigma left( {v_{ik}^{left( l right)} odot left( {{varvec{W}}v_{kj}^{left( l right)} } right)} right),} \ end{array}$$

(13)

where (sigma) is the sigmoid activation function, (odot) is the element-wise multiplication, and ({varvec{W}} in {mathbb{R}}^{{{varvec{d}}_{{varvec{z}}} times {varvec{d}}_{{varvec{z}}} }}) is a learnable parameter matrix used to combine two short paths of length (l) (path between (i) and (j), and path between (j) and (k)) to generate one long path of length (2l).

All paths from node (i) to node (k) are aggregated to form the representation of the edge from node (i) to node (j) of length (2l) as follows:

$$begin{array}{*{20}c} {v_{ij}^{{left( {2l} right)}} = alpha v_{ij}^{left( l right)} + left( {1 – alpha } right)mathop sum limits_{k ne i,j} fleft( {v_{ik}^{left( l right)} ,v_{kj}^{left( l right)} } right),} \ end{array}$$

(14)

where (alpha in left[ {0,1} right]) is a linear interpolation scalar to control the contribution of edges of length (l).

After obtaining the path representation of any entity pair of interest, we adopt the softmax function as classifier. Like in KEoG, both cross-entropy loss function and soft F-measure loss function are used as a part of the total loss function.