# Analysis of ownership network of European companies using gravity models – Applied Network Science

#### ByZsolt Tibor Kosztyán, Ferenc Király and Marcell T. Kurbucz

Aug 29, 2022

The study combines data-driven and model-driven approaches. The employed data-driven methods came from network sciences, such as calculating centralities to identify the key regions in investment and community-based modularity detection to identify communities of regions. On the other hand, the applied gravity model is a frequently used economic model, where the rate of flows, such as migration, trade exchanges—or in this case, the number of investments—has to be modeled.

The employed data-driven approach, in contrast to the traditional model-driven approaches, could not be based on a preliminary research model and the associated research hypotheses. However, clearly defined research purposes and the associated research questions are formulated. In addition, the combination of data-driven and model-driven approaches allows scholars to formulate more specific research questions (RQs) as follows:

1. I.

Methodological research questions:

(text{RQ}_1):

Is it possible to improve link prediction via the proposed GEN null model?

(text{RQ}_2):

Is it possible to improve the derivative network coefficients, such as centralities and modularity values, by the proposed GEN link prediction?

1. II.

Applications of null models:

(text{RQ}_3):

Do administrative (such as country) borders affect investments?

(text{RQ}_4):

How do investments change if distance does not play a role?

(text{RQ}_5):

What kind of EICs can be identified with the GEN null model? Are they stable in time and space?

(text{RQ}_1) and (text{RQ}_2) are derived from (text{A}_1) and (text{A}_2). One of the main goals of this study is to propose a better null model that better predicts the links (i.e., the number of owners) between regions. In a spatiotemporal network, which is the company ownership network, not only the distance but also the economic and technological environment, as well as the financial status of companies, can influence the links between nodes. Therefore, not only the links (see (text{A}_1)(text{RQ}_1)) but also the derived network characteristics, such as centrality and modularity values, can be predicted more accurately (see (text{A}_2)(text{RQ}_2)).

Without underestimating the importance of methodological questions, the interesting questions can be centered around the interpretation of the results (see (text{A}_{3})(text{A}_{4}) and (text{RQ}_3)(text{RQ}_5)). The establishment of a new subsidiary can be considered an investment in which technology and knowledge transfer also take place.

The configuration model of Newman (2010) shows that if links between nodes are concentrated around geographical locations, then the distance between nodes should be considered in null models (Expert et al. 2011). At the same time, distance dependence alone does not explain why administrative boundaries are returned during a module search (see (text{RQ}_3)). In that case, we can rightly assume that other economic, technological, or corporate characteristics also influence the decision of investments. Indeed, while in the European Union, the federalist and sovereignist positions fight with each other in almost all areas of decision-making (Saurugger 2018; Heidbreder 2022), an important question may be whether the administrative borders (here primarily the country borders) play a role.

Nevertheless, using the distance-dependent null model in community detection gives us the opportunity to ask questions about what kind of relationships would develop if distance did not play a role (see (text{RQ}_4)). In addition, by using GEN-based null models, the following question can be answered: how does an EIC change in time and space (see (text{RQ}_5)). Since EICs show regions where the connections between regions are denser than the gravity model predicts if administrative borders are returned as modules, then this indicates that the financial, economic, and technological differences are still decisive in the unified economic area.

Since the gravity model is a classical economic model that follows a model-driven approach, research hypotheses (RHs) can also be stated.

(text{RH}_1):

The links within companies’ ownership networks can be modeled by the distance between regions, the economic and technological properties of the regions, and the financial status of the companies.

(text{RH}_2):

The administrative borders play an important role in the formation of EICs, which are stable in space and time.

The (text{RH}_{1}) determines the groups of indicators involved in the gravity models. The applied GEN-model-based community detection specifies EICs, which mainly reflect the country’s borders. We also assume that these EICs are stable in space and time (see (text{RH}_{2})).

### Data employed

Table 1 shows the list of applied indicators and the data sources.

Note that in the Eurostat database, there is no information about the NUTS 3 GDP data for Iceland (2 regions), Liechtenstein (1 region), Switzerland (25 regions), and the United Kingdom (179 regions); therefore, we used the GDP per capita values for all countries.Footnote 7

To test (text{RH}_{1}), data on companies’ financial status ((m_1-m_{14})), as well as regional economic ((m_{15})) and technological ((m_{16})) indicators and interregional distances ((d_{i,j}))are collected. These variables were treated as independent variables, while the dependent variable was the number of owners of i region companies in the j region ((a_{i,j})).

### Methods employed

#### Network representation of ownership

The parent-daughter relationship between firms was characterized by means of a binary adjacency matrix (mathbf {A}), whose elements are defined as:

begin{aligned} a_{i,j}={left{ begin{array}{ll}1 &{} text {if the }itext {-th company owns the }jtext {-th company}.\ 0 &{} text {otherwise} end{array}right. } end{aligned}

(1)

Because of the difficulties of interpreting aggregations, the rate of ownership is not considered. The adjacency matrix A is further called the company ownership matrix (COM). The database contains the exact geographical location of each company. Moreover, since all our economic and technological indicators were provided at the NUTS 3 level and we also wanted to preserve the anonymity of the companies, we aggregated the data to the NUTS 3 level. However, these data are stored separately to use in link (i.e., the number of ownerships) prediction between NUTS 3 regions. Each settlement was assigned to a NUTS 3 region (county). Companies are assigned to geographic regions by the (mathbf {A}^{text {[mo,NUTS 3]}}) and (mathbf {A}^{text {[da,NUTS 3]}}) incidence matrices, whose elements are defined as:

• (a_{i,j}^{text {[mo,NUTS 3]}}) with element one if the headquarters of the i-th mother company is situated in the j-th NUTS 3 geographic region,

• (a_{i,j}^{text {[da,NUTS 3]}}) with element one if the i-th daughter is situated in the j-th NUTS 3 geographic region,

Therefore, the directed weighted network that defines the number of investment connections between the regions can be defined as:

begin{aligned} mathbf {A}^{text {[NUTS 3]}} = left( mathbf {A}^{text {[da,NUTS 3]}}right) ^T times mathbf {A} times mathbf {A}^{text {[mo,NUTS 3]}},, end{aligned}

(2)

where (mathbf {A}^{text {[NUTS 3]}}) is the (aggregated) company ownership matrix (ACOM). If both the subsidiary (daughter) and the parent company are stated in the same NUTS 3 region, then a self-loop is formated in a NUTS 3 level. (a_{i,j}in mathbf {A}^{[text {NUTS 3}]}) represents the number of owners between NUTS 3 region i and j.

The advantage of this company-to-local transformation is to create the opportunity to analyze connections between regions via yearly cross-sectional analysis.

If we examine several periods, we obtain three-dimensional arrays instead of adjacency matrices, where the third dimension is time (i.e., year). As we address intercounty relations throughout, the NUTS 3 notation is neglected. An adjacent matrix in year t is denoted as (mathbf {A}_t=mathbf {A}^{text {[NUTS 3]}}_t).

#### Applied null models

Null models predict connections between nodes. The most widely applied null model is the random configuration model specified by Newman and Girvan (2004), which calculates the prediction assuming a random graph conditioned to preserve the degree sequence of the original network:

begin{aligned} a_{i,j}sim p^{[NG]}_{i,j}=frac{k^{[out]}_{i} k^{[in]}_{j} }{L}, end{aligned}

(3)

where (k^{[out]}_i=sum _ja_{i,j}), (k^{[in]}_j=sum _ia_{i,j}), and (L=sum _isum _ja_{i,j}). Note that self-loops created during the regional aggregation of the ownership network have to be treated. To this end, Arenas et al. (2008) proposed a multiresolution method called AFG (after the authors, Arenas, Fernandez, and Gomez) by adding r self-loops to each node. This algorithm increases the strength of a node without altering the topological characteristics of the original network, as follows: (mathbf {A}_r) = (mathbf {A}) + r (mathbf {I}), where (mathbf {I}) denotes the identity matrix and r the weight of the self-loops of each node. We used this correction in the case of finding modules; however, this compensation underestimates the self-loops.

The so-called randomized null model presented by Eq. (3) is inaccurate in most real-world networks Liu et al. (2012b). Nevertheless, several community-based detection methods, such as modularity detection, are based on this random configuration model (Newman 2010).

One of the main disadvantages of the randomized null model is that it neglects the distance dependency between nodes (i.e., regions). The following null model can be specified by considering distance dependency and the use of the attractiveness or importance of nodes instead of the sum of incoming or outgoing edges (Barthélemy 2011; Expert et al. 2011):

begin{aligned} a_{i,j}sim p^{[spat]}_{i,j}=gamma left( I^{[out]}_{i}right) ^{alpha } left( I^{[in]}_{j}right) ^{beta }f(d_{i,j}), end{aligned}

(4)

where (I^{[out]}_{i}) ((I^{[in]}_{j})) denotes the importance (or attractiveness) of nodes. (alpha ,beta) are fitting parameters. Since (sum _isum _jp_{i,j}=sum _isum _ja_{i,j}), (gamma =frac{L}{sum _isum _jleft( I^{[out]}_{i}right) ^{alpha } left( I^{[in]}_{j}f(d_{i,j})right) ^{beta }}). The function (f(d_{i,j})) can be directly measured from the data by means of a binning procedure, where the prediction error should be minimized, similar to that used in Expert et al. (2011):

begin{aligned} f(d) = frac{sum _{i,j|d_{i,j}=d}a_{i,j}}{sum _{i,j|d_{i,j}=d}I^{out}_{i} I^{in}_{j}}. end{aligned}

(5)

Note that Eq. (4) is the generalized version of Eq. (3). Additionally, these null models are identical if (alpha =beta =f(d)=1, gamma =1/L). AFG correction can also be used in distance-dependent predictions; however, if (f(d_{ij})ne infty , forall d_{i,j}=0) then all self-loops can be predicted by Eq. (4). It is important to note that Eq. (4) is already a hybrid null model. Since Eq. (3) predicts links solely by network characteristics, such as incoming and outgoing edges, excluding any other influence indicator, which determines the weights of edges between nodes, Eq. (4) already includes the distance dependency in the model. In addition, regression parameters also allow distinguishing the importance of incoming and outgoing edges, which has already different meanings.

Only one step remains to estimate the probability of connections with a gravity model, where (f(d_{i,j})=d^{delta }). Following the notation of gravity models, instead of I, we denote m as the characteristics of nodes (i.e., regions) (Gadár et al. 2018), such as GDP per capita and population. The null model is generalized as follows:

begin{aligned} a_{i,j}sim p^{[grav]}_{i,j}=gamma d_{i,j}^{delta } prod _{v=1}^Nm_{i_{v}}^{alpha _v} m_{j_{v}}^{beta _v}, end{aligned}

(6)

where N is the number of indicators belonging to the nodes. (alpha , beta , gamma , delta) are regression coefficients. Eq. (6) further called this the gravity-based economic null model. If (d_{i,j}ne 0) regression parameters can be estimated via the logarithmic version of GEN (see Eq. (6)):

begin{aligned} log a_{i,j}sim log p^{[grav]}_{i,j}=log gamma + delta log d_{i,j} + sum _{v=1}^N alpha _v log m_{i_{v}} + sum _{v=1}^N beta _v log m_{j_{v}}. end{aligned}

(7)

In this study, (forall m_{i}>0), however, because of the self-loops, (d_{i,i}=0). If there is no exact knowledge about the distances, there are two ways to handle self-loops. One way is to add (1,text {km}) to every distance. In this way, (log (d_{i,i}+1)=0), and Eq. (7) can be solved. Nevertheless, Burger et al. (2009) showed that this correction can distort the estimation; therefore, they suggested solving Eq. (6) by Poisson regression instead of solving Eq. (7) directly. At the same time, the geocoded location of the company exists; therefore, in the aggregation, the average distance is used in NUTS 3 regional self-loops instead of using only one correction value. Note that this average distance can be calculated for every pair of regions; however, this correction had no significant effect and was therefore only used in self-loops.

Since Eq. (7) provides a linear regression model, and all assumptions of regression models, such as normality, homogeneity and independence (i.e., there is no multicollinearity), must be satisfied. To test for multicollinearity, we used the variance inflation factor (VIF).

begin{aligned} VIF_i=frac{1}{1-R_i^2} end{aligned}

(8)

where (VIF_iin [1,infty [) is the variance inflation factor for variable i. (R^2_i) is the coefficient of determination of the regression equation (X_i=alpha _0+alpha _1X_1+dots +alpha _{i-1}X_{i-1}+alpha _{i+1}X_{i+1}+dots +alpha _{n}X_{n}+epsilon).

To reduce the multicollinearity, the greatest VIF should be less than 2.5 ((max _iVIF_i<2.5)) Johnston et al. (2018).

The proposed gravity-based economic null model (GEN, see Eq. (6)) and its logarithmic version (see Eq. (7)) are purely economic models, and there is no network property involved. At the same time, we assume that GEN provides better link predictions than other null models (see (text{A}_1), (text{RQ}_1)). In addition, via better link prediction, an estimated network can also be predicted where the network properties, such as centralities and modularity values, can be calculated. Better link prediction also provides lower prediction error in derived properties (see (text{A}_2), (text{RQ}_2)). At the same time, it must not be forgotten that the GEN is purely an economic model, which thus models not only the formation of edges but also the formation of centralities and modules.

The goodness of fit of null models is determined by how well edges are estimated. Therefore, if there are variable parameters, the absolute differences between the real and predicted edge values must be minimized. Formally:

begin{aligned} min longleftarrow epsilon =||mathbf {A}-mathbf {P}||. end{aligned}

(9)

This section introduces three kinds of null models. Newman and Girvan (2004)’s model considers only network properties during link prediction. While Expert et al. (2011)’s distant dependent null model already considers the spatial dependencies between nodes, it can be considered a hybrid model because spatial and network properties are involved simultaneously. Several other null models can be found in Barthélemy (2011), but to the best of our knowledge, the proposed GEN model is the first, which predicts links based on purely spatial, economic, technological, and corporate financial data but does not employ network-property data.

Note that in the case of GEN, the minimization problem is very similar to the gravity-based economic models. Since in the regression model the square estimation error, while in the case of optimizing null models the absolute difference between the original and the predicted links should be minimized. This similarity provides for the employment of gravity models for link prediction and via link prediction the prediction of the company ownership network.

#### Communities

One of the main applications of null models is to detect communities. Classical modularity optimization-based community detection methods utilize f(C) metrics based on the difference between the internal number of edges and their link prediction (Newman and Girvan 2004; Yang and Leskovec 2015).

begin{aligned} f(C) = text {(fraction of edges within communities)} – text {(expected fraction of such edges)}. end{aligned}

(10)

In the case of the proposed directed network, this difference can be formulated as

begin{aligned} f(C)=frac{1}{L}sum _isum _jleft( a_{i,j}-p_{i,j}right) delta left( C_i,C_jright) , end{aligned}

(11)

where (p_{i,j}) represents the number of estimated ownership relationships from region i to region j and (delta left( C_i,C_jright)) is the Kronecker delta function, which is equal to one if the i-th and j-th regions are assigned.

The modularity of the partition C can be calculated as the sum of the modularities of the (C_c,, ,c=1,ldots ,n_c) communities:

begin{aligned} max longleftarrow M_c=frac{1}{L}sum limits _{(i,j)in C_c}(a_{i,j}-p_{i,j}),. end{aligned}

(12)

The value of the modularity (M_c) of a cluster (C_c) can be positive, negative or zero. Should it be equal to zero, the community has as many links as the null model predicts. When the modularity is positive, the (C_c) subgraph tends to be a community that exhibits a stronger degree of internal cohesion than the model predicts. When specifying modules, Eq. (12) must be maximized. When using randomized null models, the modules specify communities where connections are stronger between members within a community than between members of two distinct communities (Newman 2010). A number of links between nodes are dependent on the distances on a spatial network (Expert et al. 2011); therefore, modules give a set of nodes that are close together in geographical terms; however, if they give larger regional units, such as countries, then other formation forces can also be guessed. Therefore, it is a question to be answered whether modules provide larger regions (see (text{RQ}_3)).

The distance-dependent modules already compensate for the effect of spatial distances between regions. Therefore, the modules can be treated as a module without regional distances. In other words, we can analyze what happens if there are no spatial distances between regions (see (text{RQ}_4)).

In the case of gravity models, modules specify the area of investments (Gadár et al. 2018); we further called economic-investment communities (EICs). EICs specify a set of regions where the strength of investments (modeled by a number of ownerships) are denser than the economic, financial, and technological opportunities, as well as the geographical distances predict. If EICs also give back the administrative boundaries, it indicates that the administrative boundaries are the main formation force in investments (see (text{RH}_{2})), which should be considered at the European Union level.

This study proposes a generalization of gravity null models (GEN). This model also highlights which economic and technological indicators influence the formation of investment areas of regions (see (text{RH}_{1})).

Eq. (12) is typically solved via Louvain’s algorithm (Blondel et al. 2008); however, to increase the stability of the results, the recent Leiden’s algorithm is applied in this study (Traag et al. 2019).

Since a company ownership network (CON) can represent a static network of ownership, if it is important to analyze CON in time, the one way is to specify a multilayer network, where every layer represents a year. Yearly null models deal only with one layer at once; therefore, all predictions can be performed simultaneously. The other way is to use the dynamic network, where edges between nodes are specified within a time frame. However, this model is better in the case of continuous time intervals. While the multilayer network represents a set of yearly static networks, the existing null models can be extended in an easier way to a multilayer network.

Both algorithms can be generalized to multilayer networks, where layers represent a time slice. Thus, the proposed Gravity null models can be used to predict the links in a multilayer network, and modules can specify the yearly EICs.

The whole network formation can be modeled via link prediction; in this way, several network properties, such as centralities, can be estimated. In addition, the formation of these coefficients can be explained, and their changes over time can be predicted.

#### Multilayer network as a discrete model of a spatial-temporal network

A multilayer network is a pair (mathcal {M}=(mathcal {G},mathcal {C})), where (mathcal {G}={G_{alpha }=(V_{alpha },E_{alpha },W_{alpha }),alpha in {1,..,m}}) is a family of (directed or undirected, weighted graphs (called layers of (mathcal {M})), where (V_{alpha }) is the set of vertices (set of nodes), (E_{alpha }) (subseteq V_{alpha }times V_{alpha }) is the set of edges (links, or arcs), and (W_{alpha }:V_{alpha }times V_{alpha }rightarrow mathbb {R}_0^+) is the weight matrix of edges of graph (G_{alpha }) in layer (alpha) and

begin{aligned} mathcal {C}={E_{alpha ,beta }subseteq V_{alpha } times V_{beta }, W_{alpha ,beta }:V_{alpha } times V_{beta }rightarrow mathbb {R}_0^+, alpha , beta in {1,..,m}, alpha ne beta } end{aligned}

(13)

is the set of interconnections between nodes of different layers (G_{alpha }, G_{beta }in mathcal {M}) with (alpha ne beta).

In this study, the set of interconnections is not specified; therefore, it is assumed that (mathcal {C}=emptyset). Note that in the case of spatial and temporal networks, a layer can represent a time slice (i.e., a year), (alpha =t). In addition, the regions are time invariant; therefore, (V_{t}=V, forall t). Only the weights of edges may change over time. Thus, the connections between regions can be estimated separately (see Eq. 14) using a yearly gravity model:

begin{aligned} log a_{i,j,t}sim log p^{[grav]}_{i,j,t}=log gamma _t + delta _t log d_{i,j} + sum _{v=1}^N alpha _{v_{t}} log m_{i_{t,v}} + sum _{v=1}^N beta _{v_{t}} log m_{j_{t,v}}. end{aligned}

(14)

Shifts in regression parameters indicate changes in the role of geographical, economic and technological indicators. The analysis of embeddedness using the multilayer version of centralities indicates shifts in role-player regions.

Finally, analyzing the shifts in modules in time and space indicates the changes in EICs, while calculating modules in a multilayer structure provides time-invariant economic-investment communities.

### Centralities

Centralities are traditionally used as descriptive network properties in network science to identify key nodes (roleplayer) in a network. However, if not only links but also links, the whole network can be predicted, and the centralities can be calculated for the predicted network. In other words, in this way, the centralities are predicted. This prediction offers scholars to analyze which indicators influence a region to become a roleplayer. For this analysis, centralities should be modeled as much as possible (see (text{A}_2)).

Since a directed graph is employed to distinguish the mother-daughter relationships of companies, Only the directed versions and generalized versions of centralities are used, such as in-degree, out-degree, betweenness, in-closeness, out-closeness, authorities, host, and PageRank centralities.

Degree centrality is defined as the number of links incident upon a node (i.e., the number of ties that a node has). In the case of a directed network (where ties have direction), we usually define two separate measures of degree centrality, namely, in-degree and out-degree. In-degree is a count of the number of ties directed to the node, and out-degree is the number of ties that the node directs to others. The degree centrality of a vertex v is defined as:

begin{aligned} C_D{v}= & {} text {deg}(v), end{aligned}

(15)

begin{aligned} C_D^+{v}= & {} text {in-deg}(v), end{aligned}

(16)

begin{aligned} C_D^-{v}= & {} text {out-deg}(v). end{aligned}

(17)

In a connected graph, the normalized closeness centrality (or closeness) of a node is the average length of the shortest path between the node and all other nodes in the graph. Thus, the more central a node is, the closer it is to all other nodes.

Closeness is defined by Bavelas (1950) as the reciprocal of farness, that is:

begin{aligned} C_c=frac{1}{sum _wd(v,w)}, end{aligned}

(18)

where d(vw) is the graph distance between vertices v and w. Distances from or to all other nodes are irrelevant in undirected graphs, whereas they can produce totally different results in directed graphs.

Betweenness centrality ((C_B)) quantifies the number of times a node acts as a bridge along the shortest path between two other nodes. Vertices that have a high probability of occurring on a randomly chosen shortest path between two randomly chosen vertices have a high betweenness.

PageRank satisfies the following equation:

begin{aligned} v_i=alpha sum _j a_{ji}frac{v_j}{L(j)}+frac{1-alpha }{N}, end{aligned}

(19)

where

begin{aligned} L(j)=sum _ia_{ji} end{aligned}

(20)

is the number of neighbors of node j. (alpha in [0,1]), where N is the number of nodes.

Hub centrality ((C_H)) and authority centrality ((C_A)) are calculated to obtain the ranking results. The hub value is the centrality of a node in its ability to make a relation with other nodes, while the authority value is the centrality value of a node based on the number of relations to the node.

The Newman and Girvan (2004), Expert et al. (2011) and proposed GEN models predict links and networks; thus, the centralities can be calculated for both the original and predicted networks. The absolute error of the centralities can be calculated as follows:

begin{aligned} epsilon _C=frac{1}{N}sum _v|C(v)-widehat{C}(v)|, end{aligned}

(21)

where C(v) is the centrality measure for vertex v, N is the number of nodes, C is the original, and (widehat{C}) is the predicted centrality measure.

Do not forget that in the case of low (epsilon _C), the GEN-based prediction, which uses purely economic, corporate financial, and technological indicators in the prediction, models centralities indirectly. This model shows which kind of mixture of spatial-economic-financial-technological indicators can increase the role of a region.