# Leveraging change point detection to discover natural experiments in data – EPJ Data Science

#### ByYuzi He, Keith A. Burghardt and Kristina Lerman

Sep 3, 2022

We demonstrate the accuracy and robustness of our method on data from a variety of domains. We first apply it to synthetic data to evaluate method’s performance and robustness with respect to noise, then apply it to real-world data to discover changes corresponding to external events. Finally, we illustrate how leveraging regression discontinuities around the newly-discovered changes enables us to estimate effects of events and policies.

### Discovering changes in synthetic data

#### Synthetic “chessboard pattern”

In this experiment, we generate two-dimensional numeric data in a chessboard pattern, with two features (x_{1}) and (x_{2}), each in the range ([0, 1]), as shown in Fig. 1. At a time (t_{0}), data points spread uniformly at random within the blue squares of a (n_{c}times n_{c}) chessboard move to the orange squares of the chessboard. Mathematically, for (n_{c}times n_{c}) chessboard, the data generated satisfies the following condition,


(

n
c

x
1

+

n
c

x
2

)
mod2=1(t>
t
0
).

(9)

For first part of this experiment, we set (t_{0} = 0.5) and the size of the data (N = 8K). We use different arrangements of the chessboard, (n_{c} = {2, 4, 6, 8, 10}). For higher (n_{c}), the data is grouped into smaller chess squares with fewer data points per square. For second part of this experiment, we fix (n_{c} = 6) (a six by six chessboard) and we vary (t_{0}) between 0.2 and 0.8.

We repeat our method and comparing algorithms for 6 times on random data splits. For the optimal segmentation methods, we randomly sample 70% of data in each trial. Due to computational limitations, we only sample 18.8% of data (around 1.5K) for Bayesian change point detection.

The results are shown in Table 1. In the tables and figures, μ and σ are the estimated mean and standard error of parameters, respectively. For our method, α represents the fraction of data changed. We see that for small (n_{c}), optimal segmentation methods perform as well as ours, but for (n_{c} ge 6), our method outperforms comparing methods. Of the two classifiers used by our method, random forest performs better.

#### Synthetic images

Our method can also identify changes in diverse high-dimensional data, such as text [18] and images. To illustrate this we generate a series of synthetic 64 by 64 pixel gray scale images that qualitatively change at (t_{0}=0.5) from solid to hollow circles (Fig. 2). These images can represent, for example, organisms that were originally alive and then died; thus our task would be to determine the moment an organism died, a finding that is very useful in the field of survival analysis [37]. The gray scale of the solid and hollow circles is (gamma = 0.8) and the gray scale of the background is (gamma = 0.2). To create more realistic data, we position the circles randomly within the image and inject different levels of Gaussian noise to model poor quality data. After adding noise, pixel grey scale values are truncated to the range ([0.0, 1.0]). We also assign each image a random time t uniformly distributed between 0 and 1. For every noise level, we generated a dataset with 4,000 images respectively.

We check the robustness of the estimated change point against noise. Table 2 shows the inferred change point and estimated value of α as a function of noise for the synthetic image data. Due to spatial correlation of image data and the superior predicting power of CNN classifier, the change point inferred is close (often not statistically significantly different) to the true change point and α is close to 1.0, even for very noisy image frames. Alternative methods were infeasible because of the high-dimension and large data size.

### Discovering changes in real-world data

We now demonstrate the ability of MtChD to identify changes in real-world data.

#### Covid-19 air quality

We first apply our method to air pollution data to see if pollution drops around the time the COVID-19 pandemic occurred. We collected air quality data daily from January 1 to May 26, 2020 for major U.S. cities from AQICN (aqicn.org). This data includes daily concentrations of nitrogen dioxide, carbon monoxide, and fine particulates less than 2.5 microns across (PM2.5), totalling 4.3K observations for 37 cities across the U.S. once missing data are removed. We also include population within 50 km of the city as a feature because people within this area may have contributed to the concentration of pollutants. We can use our model to determine when the change started, and compare these results to the gold standard: the date stay-at-home orders were issued by states. These orders limited business and commercial activity, which likely lead to the dramatic decline in pollution, and therefore act as the ground truth external events for RDDs. The earliest such order was announced in California on March 19, 2020 and the latest in South Carolina on April 7.

We compare our method to state-of-the-art algorithms in Table 3. Our method is the only one that inferred a reasonable change point for the data of March 21, 2020 ± 3 days, roughly in the middle of all state stay-at-home orders. We show accuracy deviation for MtChD in Fig. 3. A random forest classifier gives better accuracy than MLP and the mathematical model fits accuracy deviation well. Although our method can work with any classifier, the performance on a given dataset can be improved by choosing a classifier that best fits the data. Some empirical ways to determine which classifier to use is (a) choosing the classifier that gives the largest accuracy deviation or (b) choosing the classifier that gives the highest α.

As a second example, we apply our method to the learning platform Khan Academy (khanacademy.org), which offers courses on a variety of subjects where students watch videos and test their knowledge by answering questions. The Khan Academy platform had undergone substantial changes to its user interface around April 1, 2013 (or (1.3648times 10^{9}) in Unix epoch time) [38], which affected user performance. This change acts as a ground truth event we want to detect. After discovering this event, we can take regressions of scores before and after the event and determine if this policy significantly changes student performance scores via a RDD.

Data was collected by Khan Academy over the period from June 2012 to February 2014 and contains 16K questions answered by 13K students totalling 681K data points. Despite the large number of students, the data is very sparse: the vast majority of students were typically active for less than 20 minutes and never returned to the site. The performance data records whether the student solved the problem correctly on their first attempt and without a hint. When the user failed, they were able to attempt the problem again, and the number of attempts made on a problem is recorded. Additional features recorded include the time since the previous problem, the number of problems in a student session, and the number of sessions. Segmentation methods implemented in ruptures are not memory efficient, therefore we only sample 0.5% of the data (about 3.5K entries) uniformly at random. For Bayesian change point detection, we sampled around 1.6K data points uniformly at random.

Both our method and optimal segmentation algorithms can identify the change from user performance data (Table 3), although optimal segmentation algorithms have larger error. Bayesian change point detection does not give a reasonable change point for this data. The accuracy deviation curve is shown in Fig. 4. The random forest classifier and MLP classifier have comparable performance when used to estimate change points.

### Measuring effects of changes via regression discontinuity design

We demonstrate how we can use regression discontinuity design to measure the effects of changes on the population. Automatically discovered changes can therefore help uncover potential natural experiments in data.

#### Persistence and performance in learning on Khan academy

Our analysis uncovered an abrupt change around April 2013 in the Khan Academy data (Sect. 4.2.2). The change only affected user performance in a fraction of all sessions, quantified by parameter α in Table 3. This change was likely due to a major redesign of the platform’s user interface [38], although we do not know exactly what changed. We found no indication that the population was any different before and after the change. Therefore, the April 2013 change could be used for a RDD, with some users “assigned” quasi-randomly to visit the platform before the interface change and some after. This created an effective control condition (before the change) and treatment condition (after the change). The external event allows us to control for some of the confounders when investigating correlates of performance in learning platforms. Specifically, comparing treated group to the controled helps identify the link between persistence (working longer on problems first answered incorrectly) and performance (answering the problem correctly on the first attempt).

Figure 5(a) shows average performance over time, measured as the fraction of problems the user solved correctly on their first attempt. Performance decreases gradually for all users over the two-year period (blue line), despite seasonal variation. However, for users working on problems that take more than 100 seconds to answer, i.e., hard problems, performance increases after the change (orange line). To estimate the effect of the change, we binned the data and fit the outcomes before and after the change as functions of time using two kernel models (see Appendix A.2 for details). The effect is strongest in users who solve hard problems correctly on their first attempt (Fig. 5(b)). At the same time, users became more persistent, i.e., more likely to continue working on a problem they did not solve correctly on the first attempt (Fig. 6(a)). The effect is bigger for users working on hard problems (Fig. 6(b)). Thus, the change had two effects: it made users working on hard (to them) problems more persistent, and this improved their performance on other hard problems, i.e., made them more likely to correctly solve these problems on the first try. Improvement in performance for these users was large, 10%, which corresponds to a full letter grade in a class setting. Psychological studies have identified traits, such as conscientiousness or grit, that allow some people to practice a skill so as to achieve mastery [39]. Our study supports the link between persistence and improved performance.

#### Covid-19 lockdowns reduced air pollution

We detect a change on Mar. 21, 2020 in the COVID-19 Air Quality data (Sect. 4.2.1). The change is consistent with the dates of the COVID-19 lockdown orders in the US, in which people had to stay at home to reduce the spread of the disease. We calculated the change in nitrogen dioxide levels before and after Mar. 21, 2020 as shown in Fig. 7. For both Manhattan and San Francisco, nitrogen dioxide levels drop significantly (by around 5 ppb) after the lockdown. The reduction in air pollution is due to reduced traffic after the lockdown. Our findings of the date and effect of the change are confirmed by Venter et al. [40].