# Using mobile money data and call detail records to explore the risks of urban migration in Tanzania – EPJ Data Science

May 8, 2022

### The data

This work is underpinned by two key datasets:

1. 1.

Pseudonymized transactional data shared by a leading Tanzanian mobile network operator, comprising of (i) mobile phone call data records; and (ii) mobile financial services or mobile money data which can be linked to the call records. Using these call records, migrants to Dar es Salaam were identified. From both the call and mobile money data, associated features were engineered and used to (1) measure the differences in mobile money and call activity between those moving to a poorer versus richer subward, and (2) predict the likelihood a given individual would migrate to an area of deprivation in Dar es Salaam.

2. 2.

An extensive street survey administered by the authors to provide ground-truth measurements for deprivation levels across subwards in Dar es Salaam. This data was used, in combination with the call records, to label whether a migrant moved to a poorer or more affluent part of the city—the dependent variable in the prediction model.

#### Call records and mobile money transactions

The call data consisted of logs every time someone received or made a call in 2014. This data allowed us to track movement patterns of individuals over time. Call detail records represent the majority of mobile phone activity in Tanzania. Voice calls make up 50% of revenue from mobile devices in Tanzania, compared to just 10% for both data and SMS (the remaining 30% is from mobile financial services) [60]. The call data was pseudonymized before being received, so that individuals were only linkable by a unique identifier. Using these, the call data was able to be attached to mobile financial services data, also from the same commercial provider. Mobile money data consisted of a log every time a customer of the service sent or received money, or checked their balance. The data used in this study covered a total of 800,157,047 call events, and 48,435,309 transactions from 27,625 mobile phone subscribers in the Dar es Salaam region over the year 2014.Footnote 1 To help provide better contextual understanding of the data and findings, the project engaged with local experts on mobile money and migration in Tanzania, and Dar es Salaam more specifically.

#### Street survey

Dar es Salaam is divided into 452 administrative areas referred to as subwards, which are the lowest formal level of administrative division in the city. The ‘street survey’ data collected in Dar es Salaam consists of these subwards ranked by affluence. Rankings were assigned from 75,078 comparative judgements made by 224 local participants, whom we refer to henceforth as judges.

To collect the data, a participatory approach was used to quantify knowledge and opinions of local residents on the ground in Dar es Salaam. To carry out the judgements, a web interface was designed so that judges could be shown images of pairs of subwards and asked to compare the affluence. At the start of the survey, judges were asked to identify areas of the city they were familiar with. Then, during the judging process, judges had the option to indicate either (i) which of the two subwards they felt was more affluent, (ii) that the subwards were roughly equal in affluence, or (iii) that they were unfamiliar with at least one of the two subwards.Footnote 2 Pairs of subwards for each judge were chosen uniformly at random from the list of all possible pairs of subwards which the judge was familiar with. For further information on the methods used for obtaining the ranks from comparative judgements see [61].

Judges were recruited through word of mouth by students at local universities, NGOs, and via a local taxi driver association. The rationale was to find judges that were citizens of Dar es Salaam with a wide working knowledge of the city’s different subwards. Data was collected in situ over two weeks in August 2018 via 17 data collection sessions each lasting two hours. At the start of each session, judges received a 15 minute training session in English and Swahili, and accompanying written instructions were also provided. Ethical approval for the study and its data collection process was obtained from the Nottingham University Business School ethical review committee, application reference No. 201819072.

### Identifying migration to Dar es Salaam

Before we could utilize the large call and mobile money datasets for our analysis, the data required some cleaning and labelling. The end goal was to label anonymous individuals in the data who we could be fairly certain, given their geo-located and timestamped call data, had migrated to Dar es Salaam in the time frame we were interested in. To make the labelling process more efficient, we first cleaned the data to remove individuals we were certain we were not interested in including in our sample (due to poor quality data, or their data not fitting our definition of migration) using some filtering rules. These rules were carefully constructed after interrogating the data, and were designed to prioritize data quality over data quantity. For example, if an individual had too few mobile interactions either before or after migrating, we did not want to include them in our final sample, as the features engineered (including pinpointing the subward they migrated to) would be inaccurate and produce unreliable indicators of the individual’s actual behaviour or circumstances.

Specifically, we were interested in identifying anonymous individuals, with good quality data, who had moved permanently or semi-permanently to Dar es Salaam in the middle third of the year, from anywhere outside of the Dar region (but still within Tanzania). To eliminate individuals who obviously did not fit this definition, we first mined the call detail records using the following rules:

1. 1.

To ensure enough data coverage and temporal stability, anonymized individuals needed to have made or received >10 calls in both the first 6 months of the year and the last 6 months of the year.

2. 2.

In the first half of the year <30% of the individual’s calls should have been made/received within Dar es Salaam, and in the last 6 months of the year >70% calls had to have been made/received from within Dar es Salaam. This was to eliminate individuals who might have moved at the start or the end of the year, and thus have inadequate data for a fair before and after moving comparison.

3. 3.

To estimate the potential move date, the date of the individual’s first stay of more than 7 consecutive days in Dar es Salaam was taken, using the cell tower location attached to the call records. This prevented us from capturing people who commuted to Dar es Salaam for work, or who were only visiting for a short stay.

4. 4.

After the move date was estimated, the individual had to have made >75% of their calls (from their move date to the end of the year) in Dar es Salaam. This was to remove people from the sample who are only visiting Dar es Salaam on a short-term/temporary basis.

5. 5.

Finally, to ensure the potential migrants had sufficient mobile money data for us to analyse and engineer features from, individuals had to have at least 10 or more mobile money transaction logs (triggered by either receiving or sending money, or checking their balance).

Using these filters, from the 27,625 individuals, a sample of 1214 potential migrants to Dar es Salaam was extracted, along with estimated move dates. Using 3D plots to visually interrogate the individual’s movement patterns, each person’s data in the sample was then labelled by human subjects on whether or not the person’s data fitted our definition of migration, and whether the estimated move date was valid or not. Discussions were had prior to labelling as to what constituted a valid case of migration and move date, and what did not. Examples of the two types of 3D plots used in this stage (which were rotatable for the labeller in the provided interface) can be viewed in Fig. 1. Whether the individual had used a cell tower on the subward corresponding to the University of Dar es Salaam’s campus was visualized to help interpret mobility patterns which may be linked to university students. All graphs in Fig. 1 except graph A show individuals that visited Dar es Salaam prior to migration, a phenomena found to be a common occurrence in the data for people who didn’t live too far away. Graph B shows an exemplar migrant affiliated with the University of Dar es Salaam. Graph D illustrates how both a broad work and a home location might be identified in the data.

Note that, in addition to the historical nature of the data, to ensure differential privacy was strictly observed we restricted location resolution in Dar es Salaam to one of the 452 subwards (with subwards having an average of approximately 15,000 inhabitants each). Nonetheless broad movement patterns could still be labelled, with Graph E showing an example of someone who visited their previous home region for an extended period after moving to Dar es Salaam. Graph F suggests the behaviour of a person who was commuting regularly to Dar es Salaam before moving there permanently (but was deemed too large of an overlap to be considered in our sample). If a single move date could not be confidently determined from the visualisations then the individual was excluded from the sample. In total 848 of 1214 instances were labelled with a move date thought to correctly depict when someone had migrated to a Dar es Salaam subward on a permanent/semi-permanent basis.Footnote 3 This subsample of 848 urban migrants was used for the remainder of the analysis. While this sample is relatively small (due to the limitations imposed by the data coverage and our working definition of migration), this work provides a first look at a well-defined subgroup of urban migrants to Dar es Salaam, that is expected to be much larger in practice. The data challenges are considered in more detail in the Discussion.

### Engineering the dependent variable

Classification of whether an individual migrated to a deprived or more affluent area within the city, was engineered using the street survey of Dar es Salaam in combination with the call records. We first estimated where in Dar es Salaam we thought a person’s new ‘home subward’ was using call data, and then linked this to the affluence rankings of the subwards, as derived from our surveyed comparative judgement ground truths (see [61]).Footnote 4

The area someone moved to in Dar es Salaam was estimated using the call records and cell tower location data. First, each individual’s ‘home tower’, the cell tower a person made the most calls with at night time (specified as between the hours of 8 pm and 8 am)Footnote 5 was identified. Then, the destination subward was simply defined as the subward in which the home tower geographically lay within. 602 towers provide network coverage across the 452 subwards in Dar es Salaam. Figure 2(c), illustrates the coverage of cell towers across the subwards. To deduce a final binary outcome variable, we then calculated whether someone moved to a poorer area (50%Footnote 6 most deprived subwards) or a more affluent area (50% least deprived subwards). Migrants whose destination sub-ward was in the district of Temeke (the southern most region of Dar es Salaam) were removed due to reduced network coverage (see Fig. 2(c)), historically different network governance, as well as the less urban nature of the district. This reduced our modelling sample to 630 individuals of which 230 (36.5%) moved to more deprived areas, and 400 (63.5%) moved to the more affluent areas. The spatial distribution of the dependent variable can be viewed in Fig. 2(b), along with a heatmap of how the migrants were distributed across the city (Fig. 2(d)).

### Engineering the independent variables

As potential indicators of vulnerable migration, a total of 110 candidate features (K) were engineered from aggregating cell phone data, mobile money data, and open source data sources [29]. A full list of the candidate features, which analysis they were used for, and their descriptions can be found in Additional file 1. Different versions of features were engineered via: (1) using only the data before the person moved ((K = 93)); and (2) using only data after they had moved ((K = 17)). These two sets of features were used in separate analyses: data from after moving to statistically analyse the social and economic differences between those who moved to a poorer versus richer subward; and data from before moving to predict whether an individual would migrate to a poorer or more affluent area of Dar es Salaam. As part of the modelling analysis, feature selection methods (outlined in more detail below) were applied to reduce the number of candidate features.

Features from the cell phone data were engineered to reflect social connectedness, as well as existing ties and connectivity with Dar es Salaam (‘pull factors’). Examples of these features include: the entropy of the numbers called; average calling distance; the percentage of calls to made to Dar es Salaam (before moving); the affluence of the area most commonly called in Dar es Salaam (before moving); and whether the individual had visited Dar es Salaam prior to moving there. Entropy features were calculated using Shannon entropy (H(X)) with a the natural logarithm:

$$H(X) = -sum_{i=1}^{n}p(x_{i})*ln(x_{i}).$$

Features from the mobile money data were engineered as potential proxies for an individual’s financial situation. Examples of these features include: whether the person had a mobile money account before moving to Dar es Salaam; their mean mobile money account balance, the average amount of money paid into the account per day; the average amount of money spent per day; amount paid out in bills; and the amount sent/received from person-to-person transfers.

Features about the region a migrating individual originated from were extracted from open source data [29]. These features reflected proxies as to the level of deprivation an individual was migrating from, representing the strength of migratory ‘push factors’. These regional variables covered a wide range of domains including: human development indices, poverty, education, gender inequality, female representation in parliament, health, and population demographics [29]. Once aggregate features for each anonymized migrant had been constructed and attached to a deprivation level of the subward they migrated to, all other call and mobile money data were expunged from the study.

### Modelling

In order to reduce the effects of excessive multi-colinearity and the curse of dimensionality, expected to be present due to the types of features constructed, two pre-processing steps were undertaken on the 93 features engineered for modelling (on data before individuals migrated). Addressing these issues, input features which were highly correlated ((text{Pearson }r~> 0.85)) were eliminated or averaged. Subsequently, features which had a Pearson correlation with the dependent variable of less than a fixed value were removed. We note that such an approach risks removing features that, while do not have any direct relationship with the output feature contribute do in fact have a relationship when considered in combination with other features. Acknowledging this we varied the Pearson correlation cut-off when considering the correlation between the input features and dependent variable from 0.05 to 0.07 (effectively as an additional meta-parameter for all models in the machine learning pipeline described below). As varying this parameter lead to no discernible decrease in predictive performance (demonstrating the limited utility of the features dropped with the higher threshold) in the remainder of this paper, for clarity, we consider models only where the cut-off was set to 0.07. After applying this pre-filter, the number of modelling candidate features was reduced from 93 to 15 features engineered on data prior to an individual’s migration to Dar es Salaam. A list of these features can be found in Additional file 1.

A machine learning pipeline was then built to predict whether someone ends up in a poorer or more affluent subward. The pipeline consisted of imputation of missing data (see Table 1 for missing data information) in the IVs using multivariate imputation [66], data scaling, recursive feature elimination,Footnote 7 and the training of a classification model. Three classes of classification algorithm were evaluated: logistic regression, decision trees, and random forests—all chosen for their interpretable variable importance outputs. Hyper-parameters used within the pipeline were selected using 10-fold cross validation on an 80% subsample of the data ((N = 504)) with the cross-validation procedure splitting this sample repeatedly into training and validation sets. The models were then re-fit on the full 80% sample based on the selected meta-parameters. The remaining 20% of the data ((N = 126)) was used as an unseen test set to evaluate the generalised predictive performance.