# Improving accuracy of expected frequency of uncertain roles based on efficient ensembling – Applied Network Science

#### BySoshi Naito and Takayasu Fushimi

Aug 12, 2022

This section describes four ensemble methods that sample possible graphs from an uncertain graph and output clustering results. As shown in Fig. 3, four ensemble methods use S possible graphs, ({G_1, ldots , G_S}), sampled from the given uncertain graph ({{mathcal {G}}}). The graph-ensemble method ensembles sampled graphs, generates a weighted graph ({{bar{G}}}), and counts motif-roles from the weighted graph. The vector-ensemble method ensembles role vectors ({{textbf {R}}_1, ldots , {textbf {R}}_S}) obtained from each sampled graph and generates an averaged role matrix (vectors) ({bar{{textbf {R}}}}). The similarity-ensemble method ensembles similarity matrices ({{textbf {C}}_1, ldots , {textbf {C}}_S}) calculated from each role matrix (vectors) and generates an averaged similarity matrix ({bar{{textbf {C}}}}). The cluster-ensemble method ensembles affiliation matrices ({{textbf {H}}_1, ldots , {textbf {H}}_S}) obtained from each similarity matrix and generates an ensembled affiliation matrix ({bar{{textbf {H}}}}), which is a clustering result. When sampling many graphs, i.e., (S simeq 2^L), the ensembled results, ({bar{G}}), ({bar{{textbf {R}}}}), ({bar{{textbf {C}}}}) and ({bar{{textbf {H}}}}) become close to the true results ({{mathcal {G}}}), ({{mathcal {R}}}), ({mathcal C}) and ({{mathcal {H}}}). The details of these existing methods are described in the following subsections.

### Graph-ensemble method

First, we explain the graph ensemble method proposed in our previous study (Naito and Fushimi 2021). The procedure for outputting the similarity matrix ({{bar{C}}}) in the graph ensemble method (hereinafter, the GE method) is shown in Algorithm 1. In the GE method, ensembling is performed on a group of sample graphs ({G_1, ldots , G_S},,G_s = (V, E_s),,E_s subseteq E) to generate an ensembled graph ({{bar{G}}}) (see Algorithm 2):

begin{aligned} {{mathcal {G}}} = {mathop {{Phi }}limits _{G in {{mathcal {G}}}}}(G;,mathrm{Pr}[G]) simeq {mathop {{Phi }}limits _{s=1}}^{S}(G_s;,1/S) = {{bar{G}}}. end{aligned}

Here, ({{bar{G}}} = (V, {{bar{E}}}, {{bar{p}}})) is a weighted graph with weights ({{bar{p}}}(e)=sum _{s=1}^{S}{delta (e in E_s)}/S), which means the sample probability of edge e appearing in S sample graphs, and (delta (cond)) is a Boolean function that returns 1 if the condition cond is True and 0 if it is False.

Next, for an ensembled graph ({{bar{G}}}), we search for connected-triples based on the algorithm of Itzhack et al. (2007) (Algorithm 3). In Algorithm 3, (varGamma (u) = {v; (u,v) in {{bar{E}}} wedge (v,u) in {{bar{E}}}}) at Line 6 stands for a set of adjacent nodes of node u, and ({{bar{varGamma }}}(u)) at Line 7 is a set of nodes searched for in the for-loop at Line 6. That is, (varGamma (u) setminus {{bar{varGamma }}}(u)) at Line 8 represents a set of adjacent nodes of node u that are not searched for at Line 6. Then, in the searched for connected-triples (G^{(m)}), the role of each node is identified and counted in consideration of the weight ({{bar{p}}}(e)). By aligning the number of roles for each node and regarding it as a vector, we construct ((N times R)) role vectors (matrix) ({bar{{textbf {R}}}}) (Algorithm 4). In detail, (1) we represent the presence or absence of 6 edges between the 3 nodes u, v, w of a connected-triple (G^{(m)}) by the 6-bit bit string ({textbf {b}}_u) via the (mathrm{motif2bits}) function; (2) we obtain the role number (i leftarrow mathrm{Rcode}({textbf {b}}_u)) from the dictionary (mathrm{Rcode}), which is a correspondence table between the bit string and the role number; (3) we add an occurrence probability (mathrm{Pr}[G^{(m)}]) to the ith element of the role vector of node u, ({{bar{r}}}_{u, i}), where (mathrm{Pr}[G^{(m)}] leftarrow prod _{e in E^{(m)}}p(e) prod _{e in E setminus E^{(m)}}left( 1-p(e)right)) is the occurrence probability of connected-triple (G^{(m)}) calculated based on the presence/absence of 6 edges and their probabilities of existence. For directed 3-node motifs, by bit-shifting the bit string ({textbf {b}}_u) focused on node u, the bit strings ({textbf {b}}_v, {textbf {b}}_w) focused on the other 2 nodes v, w can be obtained. For motifs with more than 3 nodes, this is not a simple bit shift, but the bit string can be obtained in a similar manner.

After constructing the role vectors of each node, ({bar{{textbf {H}}}}) is output by classifying each node into clusters based on the matrix ({bar{{textbf {C}}}}), whose elements are the similarity between the role vectors. In this method, the ensembled graph ({{bar{G}}}) is obtained by ensembling S graphs with L edges. Let p be the average edge existence probability, where the expected number of edges in each sample graph is pL; accordingly, an ensembled graph can be obtained with O(SpL). For one ensembled graph, the connected three nodes are searched for according to the algorithm of Itzhack et al., and the number of roles of all N nodes is counted. Therefore, as with the computational complexity of the 3-node motif count, when the average degree is ({{bar{d}}}), the ensemble role vectors ({bar{{textbf {R}}}}) is obtained with a computational complexity of (O(N {bar{d}}^2))).

### Vector-ensemble method

The role-vector-ensemble method (hereinafter, VE method) generates an ensembled role vector ({{bar{{textbf {R}}}}}) by averaging the role vector ({{textbf {R}}_1, ldots , {textbf {R}}_S}) obtained from the sample graph (G_s):

begin{aligned} {{mathcal {R}}} = {mathop {{Phi }}limits _{G in {{mathcal {G}}}}}({textbf {R}}_{G};,mathrm{Pr}[G]) simeq frac{1}{S}sum _{s=1}^S {textbf {R}}_s = {{bar{{textbf {R}}}}}. end{aligned}

Then, the cosine similarity ({{bar{{textbf {C}}}}}) is calculated from the obtained ensembled role vector ({{bar{{textbf {R}}}}}). Each node is divided into 1 of K clusters based on the similarity matrix, and ({{bar{{textbf {H}}}}}) is output. When constructing the role vector ({textbf {R}}_s) from each sample graph (G_s), the LINC algorithm (Ma et al. 2019), which is the state-of-the-art technique, is used. The LINC algorithm focuses on the difference (D_{s,s’} = (E_{s} setminus E_{s’}) cup (E_{s’} setminus E_{s})) between the edge sets (E_s) and (E_{s’}) in the two sample graphs (G_s) and (G_{s’}), and only the number of appearances of the roles related to edge (e in D_{s,s’}) whose existence/absence state has changed is updated. Let p be the average edge appearance probability. The expected value of the number of edges that change state is (2L(p-p^2)); hence, it is effective when the uncertainty is small, such as when p = 0.1 or p = 0.9. In this way, S role vectors (matrices) ({{textbf {R}}_1, ldots , {textbf {R}}_S}), each of which is an ((N times R)) matrix, are efficiently calculated and averaged to obtain an ensembled role vector ({{bar{{textbf {R}}}}}). Therefore, if ({{bar{m}}}) is the average number of motif instances including each edge, the ensembled role vector ({{bar{{textbf {R}}}}}) is obtained with a computational complexity (O(S(L(p-p^2){{bar{m}}}+NR))) by the VE method.

### Similarity-ensemble method

In the similarity-ensemble method (hereinafter, SE method), the average of similarity matrices ({{textbf {C}}_1, ldots , {textbf {C}}_S}) calculated from role vectors ({{textbf {R}}_1, ldots , {textbf {R}}_S}) is calculated, and the ensembled similarity matrix ({{bar{{textbf {C}}}}}) is generated:

begin{aligned} {{mathcal {C}}} = {mathop {{Phi }}limits _{G in {{mathcal {G}}}}}({textbf {C}}_{G};,mathrm{Pr}[G]) simeq frac{1}{S}sum _{s=1}^S {textbf {C}}_s = {{bar{{textbf {C}}}}}. end{aligned}

Then, based on the ensembled similarity matrix ({{bar{{textbf {C}}}}}), all of the nodes divided into clusters and ({{bar{{textbf {H}}}}}) are outputted. In the SE method, the number of roles in sample graphs is counted and updated based on the LINC algorithm, as in the VE method. In this way, S similarity matrices ({{textbf {C}}_1, ldots , {textbf {C}}_S}), each of which is an ((N times N)) matrix, are calculated and then averaged. Therefore, the dominant computational complexity of the SE method to obtain the ensembled similarity matrix ({{bar{{textbf {C}}}}}) is (O(SN^2)).Footnote 1

### Cluster-ensemble method

The cluster-ensemble method ensembles the clustering results ({{textbf {H}}_1, ldots , {textbf {H}}_S}) and produces the membership matrix ({bar{{textbf {H}}}}):

begin{aligned} {{mathcal {H}}} = {mathop {{Phi }}limits _{G in {{mathcal {G}}}}}({textbf {H}}_{G};,mathrm{Pr}[G]) simeq {mathop {{Phi }}limits _{s=1}}^{S}({textbf {H}}_{s};,1/S) = {{bar{{textbf {H}}}}}. end{aligned}

Unlike ensembling for supervised classification results, ensembling unsupervised clustering results is a challenging task because the correspondence relationship between obtained clusters is not clear and its degree has to be considered. Therefore, since no ensemble method for clustering results has been established yet, we do not discuss the issue in this article.