# News sharing on Twitter reveals emergent fragmentation of media agenda and persistent polarization – EPJ Data Science

#### ByTomas Cicchini, Sofia Morena del Pozo, Enzo Tagliazucchi and Pablo Balenzuela

Aug 19, 2022

Twitter users sharing links to media content were selected in order to analyze the emergence of structures from their interactions, following a similar approach than in [3].

### Data

Twitter data was acquired using the official API [33], together with custom developed Python codes. The acquisition process consisted of the following steps:

User selection: live download. Twitter activity was downloaded using the Stream Listener tool in the Twitter API. Politically active users were filtered by keywords (discarding retweets) associated with politicians, electoral alliances and political parties (see keywords in see Additional file 1) during the primary 2019 Argentinian elections (between August 5th and August 12th). Also, a control group was selected using keywords associated to the name of the main media outlets in Argentina (also discarding retweets) during the period from August 29th to September 30th 2019 and from June 4th to July 4th 2020. We collected a similar number of users for both datasets: 38K for 2019 and 35K for 2020.

Twitter activity: tweet download. Full twitter activity corresponding to both datasets was downloaded during two different periods of time: from August 29th to September 30th 2019 and from June 4th to July 4th 2020. We collected 1,368,914 tweets of politically active users in 2019 and 987,271 tweets in 2020. The control dataset comprised 511,308 tweets in 2019 and 576,137 tweets in 2020.

Embedded URLs in tweets: From previously downloaded tweets, we keep only those containing embedded URLs.

URL filter: Outlet domains were obtained from the URLs together with an Argentinian media database provided by ABYZ News Links [34]. We only keep tweets from the major twenty Argentinian media outlets, obtaining 80,811 and 66,688 tweets for the dataset comprising politically active users, and 31,811 and 41,593 tweets for the control dataset in 2019 and 2020 respectively.

The bipartite network data is shown in Table 1.

### Methods

#### Bipartite networks and their projections

The complex patterns of news shared by multiple users can be mapped onto bipartite networks following the procedure sketched in [3].

Bipartite networks have two different classes of nodes; in this case, the networks can be projected into news and user layers. Connections in the news projection indicate co-consumption across users, while the user projection describes users connected by news shared by both of them.

Projections of bipartite networks can be implemented in several ways, as for instance following the method developed in [35]. Understanding that projections introduce several noisy edges between nodes of the same layer, such a method proposes a null model to define which edge plays a significant role on the monopartite network structure. Based on this idea, here we follow a similar approach, combining a simple projection method [36] with a significance filter introduced in [37]. As in [3], a hyperbolic projection was used in order to mitigate the influence of highly connected nodes of both layers (see Additional file 1 2). The significance filter maintains those links whose weights are meaningful in comparison to those expected from a null stochastic model that preservers node degree and total strength. The same approach was followed in [38].

Given that projections do not necessarily produce fully connected networks, we kept the largest connected components for further analysis.

#### Networks metrics

We mainly focused on the analysis of collective structures and the role of nodes within these structures.

To detect communities in both projections of the bipartite networks we used a Python implementation of the Louvain algorithm [39, 40] based on the optimization of the modularity Q. Due to the stochastic nature of the Louvain algorithm, the obtained community partitions may differ from each other in a comparatively small number of nodes. To obtain a well-defined membership metric of the nodes, we constructed consensus networks [41] which allow the robust assignment of nodes to communities.

In spite of known limitations of community detection based on modularity for the case of weighted networks [42], the obtained results were robust when compared to other algorithms such as Infomap and Label Propagation, as can be seen in Additional file 1 4, showing the normalized mutual information between partitions obtained with different methods.

We also analyzed the role of users in the network by means of the participation coefficient and the within-module degree, as defined in [43]. Given a partition in communities, the within-module degree measures how connected is a node within its own community, while the participation coefficient determines how well connected is a node with respect to the other communities of the network.

#### Nodes metrics

To analyze the properties of emergent structures on both projections of the bipartite networks, we focused on metrics related to semantic content and media outlet membership (for the news projection) and to the user profile of media consumption (for the user projection).

### News projection

When nodes are identified with media content, two features are relevant for our analysis: the media outlet in which the article was published and its semantic content:

### Media outlet

We first classify each news article according to the media outlet where it has been published.

### Semantic content analysis

The following steps were applied to the text of each news article:

• tokenization: each element of the corpus was separated in individual terms, and non-alphanumeric characters and punctuation was removed. All terms were converted to lowercase.

• stopwords filtering: using the Spanish stopwords database provided by nltk [44], the most common (and thus the least informative) words were filtered out.

• term basis construction: a term basis was generated from the set of used terms.

• frequency description: each news article was described by a term frequency vector with entries corresponding to the basis computed in the previous step, i.e., the i-th term corresponded to the number of times the i-th term of the basis appeared in the corpus.

• tf-idf description: to mitigate bias due to excessive contribution of frequently used words and increase the contribution of unusual (but informative) words, the term frequency – inverse document frequency (tf-idf) statistic was computed [45], resulting in the following value for the i-th element of the vector news representation:

$$v_{i} = f_{i} cdot log biggl(frac{N}{N_{i}} biggr),$$

(1)

where N is the number of documents in the corpus and (N_{i}) is the number of documents where the i-th term appears.

After these processing steps, the news corpus was described as a matrix (M in mathrm{R}^{n times m}), with n the number of documents in the corpus and m the number of basis terms.

### Unsupervised topic detection

Starting from this mathematical representation of the corpus, it is possible to detect the main topics (i.e. groups of similar articles with roughly the same semantic content) by performing Non-negative Matrix Factorization (NMF) [46, 47] on the document-term matrix (M). NMF results in the factorization of the news-terms matrix M as the product between two matrices:

$$M approx N cdot W , quad textit{con} N in mathrm{R}^{n times t} y W in mathrm{R}^{t times m},$$

(2)

where t is the chosen number of topics, and N and W are the resulting matrix dimensions. Both matrices are composed only of positive entries, allowing their straightforward interpretation.

Following the procedure sketched in [48], we define the media agenda as the fraction of articles belonging to each topic.

### Sentiment analysis

Sentiment analysis was performed in the phrases of each news article where candidates of the main political coalitions appeared mentioned. For this, we used the algorithm developed in [49]. This algorithm assigns a label (+1, 0, −1) depending on whether the sentiment of the sentences is considered positive, neutral or negative, respectively. Using this information, each news article can be characterized by the proportion of phrases with negative, neutral and positive phrases towards each of the candidates. Then, we applied a metric known as Sentiment Bias Statistic (SB, defined in [50]), which is computed as follows.

• In each text, we detected phrases mentioning the names of the candidates of the two main coalition disputing the national elections in Argentina 2019 (“Cristina”, “Kirchner”, “Alberto”, “Fernandez” from the center-left, and “Mauricio”, “Macri” or “Pichetto” from the center right).

• We applied sentiment analysis to these sentences and counted the amount of positive, negative, and neutral mentions for each of the candidates.

• For each news article, (#KF_{+}) ((#KF_{-})) stands for fraction of positive (negative) mentions of Cristina Kirchner or Alberto Fernandez and (#MP_{+}) ((#MP_{+})) for positive (negative) mentions of Mauricio Macri or Miguel Angel Pichetto and the SB is defined according to equation (3).

$$SB = (#KF_{+} – #KF_{-}) – (#MP_{+} – #MP_{-}) .$$

(3)

The statistic SB is a measure of the bias towards one of the coalitions. If (SB>0), the bias is positive towards the candidates of the center-left (Cristina Kirchner or Alberto Fernandez) compared to the candidates of the center-right coalition (Mauricio Macri or Miguel Angel Pichetto). If (SB<0) it is the same situation, but towards the center-right coalition.

### User projection

In this projection, users are linked by the news they jointly consume. We quantified the diversity in the media outlets consumed by each user in terms of the following vector:

$$m^{i} = (1,ldots,j,ldots,0),$$

(4)

where the (m^{i}_{j}) component indicates the number of news from the j media outlet shared by the i user. Given the heterogeneity in the distribution of news outlets in the corpus, we introduced a corrected version of this metric :Footnote 1

$$bigl(m^{i}_{c}bigr)_{j} = m^{i}_{j} cdot log biggl(frac{N}{N_{j}} biggr),$$

(5)

where N is the total number of users and (N_{j}) the total number of users sharing news belonging to the j media outlet. Here, the factor (log (frac{N}{N_{j}} )) corrects the potential bias caused by a given article being shared by multiple users.

Given that users can share news from multiple outlets or, conversely, share news from only one of them, we estimated how diverse was the user behavior in terms of the shared media outlets using the maximum value of the consumed media vector. This measures the Lack of Diversity (LD) in the user behavior:

$$mathrm{LD}_{i} = max_{j in M}bigl{ bigl(m^{i}_{c}bigr)_{j}bigr} ,$$

(6)

where M is the total number of media outlets in our data set. After normalization, this lack of diversity lies between 0 and 1.