We demonstrate the accuracy and robustness of our method on data from a variety of domains. We first apply it to synthetic data to evaluate method’s performance and robustness with respect to noise, then apply it to real-world data to discover changes corresponding to external events. Finally, we illustrate how leveraging regression discontinuities around the newly-discovered changes enables us to estimate effects of events and policies.

Discovering changes in synthetic data

Synthetic “chessboard pattern”

In this experiment, we generate two-dimensional numeric data in a chessboard pattern, with two features (x_{1}) and (x_{2}), each in the range ([0, 1]), as shown in Fig. 1. At a time (t_{0}), data points spread uniformly at random within the blue squares of a (n_{c}times n_{c}) chessboard move to the orange squares of the chessboard. Mathematically, for (n_{c}times n_{c}) chessboard, the data generated satisfies the following condition,


(


n
c



x
1


+


n
c



x
2


)
mod2=1(t>
t
0
).

(9)

For first part of this experiment, we set (t_{0} = 0.5) and the size of the data (N = 8K). We use different arrangements of the chessboard, (n_{c} = {2, 4, 6, 8, 10}). For higher (n_{c}), the data is grouped into smaller chess squares with fewer data points per square. For second part of this experiment, we fix (n_{c} = 6) (a six by six chessboard) and we vary (t_{0}) between 0.2 and 0.8.

Figure 1
figure 1

Illustrations of synthetic data (a), where observations have two features (x_{1}) and (x_{2}). In (b) and (c), blue dots represent data points which satisfy (tleq t_{0}) and orange dots are for (t> t_{0}). (b) and (c) are for (n_{c}=2) and (n_{c}=6), respectively. For fixed data size N, as (n_{c}) increases, the number of data points in each square decreases

We repeat our method and comparing algorithms for 6 times on random data splits. For the optimal segmentation methods, we randomly sample 70% of data in each trial. Due to computational limitations, we only sample 18.8% of data (around 1.5K) for Bayesian change point detection.

The results are shown in Table 1. In the tables and figures, μ and σ are the estimated mean and standard error of parameters, respectively. For our method, α represents the fraction of data changed. We see that for small (n_{c}), optimal segmentation methods perform as well as ours, but for (n_{c} ge 6), our method outperforms comparing methods. Of the two classifiers used by our method, random forest performs better.

Table 1 A comprehensive comparison of the performance of the proposed method against two types of state-of-the-art methods: optimal segmentation and Bayesian change point detection on synthetic data. MtChD(RF) is our method with a random forest classifier; MtChD(MLP) is our method with a MLP classifier. DP + Normal (GLR eq.) is DP segmentation method used with normal loss function, which is equivalent to GLR test that assumes a multivariate normal distribution. Six combinations of optimal segmentation methods are listed. DP is dynamic programming segmentation algorithm, BinSeg is binary segmentation, Window is window-based change point detection, and BottomUp is Bottom-up segmentation. The cost functions used are RBF (RBF kernel), L1 ((L_{1}) loss function), and L2 ((L_{2}) loss function). The last four rows are for Bayesian change point detection with a uniform prior or Geo (geometric) prior. Gassusian stands for Gaussian likelihood function, IFM is the individual feature model [30], and FullCov is the full covariance model [30]. (mu (t_{0})) and (sigma (t_{0})) are the mean value and standard deviation of inferred change point and (mu (alpha )) and (sigma (alpha )) are the mean value and standard deviation of inferred α. Bold values indicate change points that are closest to the correct value

Synthetic images

Our method can also identify changes in diverse high-dimensional data, such as text [18] and images. To illustrate this we generate a series of synthetic 64 by 64 pixel gray scale images that qualitatively change at (t_{0}=0.5) from solid to hollow circles (Fig. 2). These images can represent, for example, organisms that were originally alive and then died; thus our task would be to determine the moment an organism died, a finding that is very useful in the field of survival analysis [37]. The gray scale of the solid and hollow circles is (gamma = 0.8) and the gray scale of the background is (gamma = 0.2). To create more realistic data, we position the circles randomly within the image and inject different levels of Gaussian noise to model poor quality data. After adding noise, pixel grey scale values are truncated to the range ([0.0, 1.0]). We also assign each image a random time t uniformly distributed between 0 and 1. For every noise level, we generated a dataset with 4,000 images respectively.

Figure 2
figure 2

Example time series of synthetic images that change at (t_{0}=0.5) when a solid circle changes to a hollow circle. From top to bottom, each row shows images with an increasing noise level (sigma = 0.2, 0.4, 0.6, 0.8) and 1.0

We check the robustness of the estimated change point against noise. Table 2 shows the inferred change point and estimated value of α as a function of noise for the synthetic image data. Due to spatial correlation of image data and the superior predicting power of CNN classifier, the change point inferred is close (often not statistically significantly different) to the true change point and α is close to 1.0, even for very noisy image frames. Alternative methods were infeasible because of the high-dimension and large data size.

Table 2 Change points inferred for noisy synthetic images. The true value of change point is (t_{0} = 0.50) where solid circles change into hollow circles with different levels of noise

Discovering changes in real-world data

We now demonstrate the ability of MtChD to identify changes in real-world data.

Covid-19 air quality

We first apply our method to air pollution data to see if pollution drops around the time the COVID-19 pandemic occurred. We collected air quality data daily from January 1 to May 26, 2020 for major U.S. cities from AQICN (aqicn.org). This data includes daily concentrations of nitrogen dioxide, carbon monoxide, and fine particulates less than 2.5 microns across (PM2.5), totalling 4.3K observations for 37 cities across the U.S. once missing data are removed. We also include population within 50 km of the city as a feature because people within this area may have contributed to the concentration of pollutants. We can use our model to determine when the change started, and compare these results to the gold standard: the date stay-at-home orders were issued by states. These orders limited business and commercial activity, which likely lead to the dramatic decline in pollution, and therefore act as the ground truth external events for RDDs. The earliest such order was announced in California on March 19, 2020 and the latest in South Carolina on April 7.

We compare our method to state-of-the-art algorithms in Table 3. Our method is the only one that inferred a reasonable change point for the data of March 21, 2020 ± 3 days, roughly in the middle of all state stay-at-home orders. We show accuracy deviation for MtChD in Fig. 3. A random forest classifier gives better accuracy than MLP and the mathematical model fits accuracy deviation well. Although our method can work with any classifier, the performance on a given dataset can be improved by choosing a classifier that best fits the data. Some empirical ways to determine which classifier to use is (a) choosing the classifier that gives the largest accuracy deviation or (b) choosing the classifier that gives the highest α.

Figure 3
figure 3

Accuracy deviation curve for COVID-19 Air data. (a) Using random forest classifier; (b) Using a MLP classifier. The scatter points are accuracy deviation measured on testing set and the solid lines are fitted using the proposed accuracy deviation model

Table 3 A comprehensive comparison of our method with previous methods on real world datasets, COVID-19 Air and Khan Academy. We use the same abbreviations as in Table 1. For COVID-19, the measure of (t_{0}) is number of days since 01/01/2020. For Khan Academy, the measure of (t_{0}) is Unix timestamp, namely, number of seconds since midnight 01/01/1970. Correct values are roughly 80 days for COVID-19 air data, and (1.365 times 10^{9}) seconds for Khan Academy data. Bold values indicate change points that are closest to the correct value

Khan academy

As a second example, we apply our method to the learning platform Khan Academy (khanacademy.org), which offers courses on a variety of subjects where students watch videos and test their knowledge by answering questions. The Khan Academy platform had undergone substantial changes to its user interface around April 1, 2013 (or (1.3648times 10^{9}) in Unix epoch time) [38], which affected user performance. This change acts as a ground truth event we want to detect. After discovering this event, we can take regressions of scores before and after the event and determine if this policy significantly changes student performance scores via a RDD.

Data was collected by Khan Academy over the period from June 2012 to February 2014 and contains 16K questions answered by 13K students totalling 681K data points. Despite the large number of students, the data is very sparse: the vast majority of students were typically active for less than 20 minutes and never returned to the site. The performance data records whether the student solved the problem correctly on their first attempt and without a hint. When the user failed, they were able to attempt the problem again, and the number of attempts made on a problem is recorded. Additional features recorded include the time since the previous problem, the number of problems in a student session, and the number of sessions. Segmentation methods implemented in ruptures are not memory efficient, therefore we only sample 0.5% of the data (about 3.5K entries) uniformly at random. For Bayesian change point detection, we sampled around 1.6K data points uniformly at random.

Both our method and optimal segmentation algorithms can identify the change from user performance data (Table 3), although optimal segmentation algorithms have larger error. Bayesian change point detection does not give a reasonable change point for this data. The accuracy deviation curve is shown in Fig. 4. The random forest classifier and MLP classifier have comparable performance when used to estimate change points.

Figure 4
figure 4

Accuracy deviation curve for Khan Academy data. (a) Using random forest classifier; (b) Using a MLP classifier

Measuring effects of changes via regression discontinuity design

We demonstrate how we can use regression discontinuity design to measure the effects of changes on the population. Automatically discovered changes can therefore help uncover potential natural experiments in data.

Persistence and performance in learning on Khan academy

Our analysis uncovered an abrupt change around April 2013 in the Khan Academy data (Sect. 4.2.2). The change only affected user performance in a fraction of all sessions, quantified by parameter α in Table 3. This change was likely due to a major redesign of the platform’s user interface [38], although we do not know exactly what changed. We found no indication that the population was any different before and after the change. Therefore, the April 2013 change could be used for a RDD, with some users “assigned” quasi-randomly to visit the platform before the interface change and some after. This created an effective control condition (before the change) and treatment condition (after the change). The external event allows us to control for some of the confounders when investigating correlates of performance in learning platforms. Specifically, comparing treated group to the controled helps identify the link between persistence (working longer on problems first answered incorrectly) and performance (answering the problem correctly on the first attempt).

Figure 5(a) shows average performance over time, measured as the fraction of problems the user solved correctly on their first attempt. Performance decreases gradually for all users over the two-year period (blue line), despite seasonal variation. However, for users working on problems that take more than 100 seconds to answer, i.e., hard problems, performance increases after the change (orange line). To estimate the effect of the change, we binned the data and fit the outcomes before and after the change as functions of time using two kernel models (see Appendix A.2 for details). The effect is strongest in users who solve hard problems correctly on their first attempt (Fig. 5(b)). At the same time, users became more persistent, i.e., more likely to continue working on a problem they did not solve correctly on the first attempt (Fig. 6(a)). The effect is bigger for users working on hard problems (Fig. 6(b)). Thus, the change had two effects: it made users working on hard (to them) problems more persistent, and this improved their performance on other hard problems, i.e., made them more likely to correctly solve these problems on the first try. Improvement in performance for these users was large, 10%, which corresponds to a full letter grade in a class setting. Psychological studies have identified traits, such as conscientiousness or grit, that allow some people to practice a skill so as to achieve mastery [39]. Our study supports the link between persistence and improved performance.

Figure 5
figure 5

Performance in Khan Academy. (a) Performance vs. time in Khan Academy data for problems binned with duration and number of attempts. (b) Change of performance for binned data

Figure 6
figure 6

Persistence in Khan Academy. (a) Persistence rate vs. time in Khan Academy data for long (≥ 100 sec) and short (< 100 sec) duration questions. (b) Change of persistence rate for long and short duration problems

Covid-19 lockdowns reduced air pollution

We detect a change on Mar. 21, 2020 in the COVID-19 Air Quality data (Sect. 4.2.1). The change is consistent with the dates of the COVID-19 lockdown orders in the US, in which people had to stay at home to reduce the spread of the disease. We calculated the change in nitrogen dioxide levels before and after Mar. 21, 2020 as shown in Fig. 7. For both Manhattan and San Francisco, nitrogen dioxide levels drop significantly (by around 5 ppb) after the lockdown. The reduction in air pollution is due to reduced traffic after the lockdown. Our findings of the date and effect of the change are confirmed by Venter et al. [40].

Figure 7
figure 7

Averaged and change of nitrogen dioxide levels before and after Mar. 21, 2020. (a) For Manhattan, (b) For San Francisco

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Disclaimer:

This article is autogenerated using RSS feeds and has not been created or edited by OA JF.

Click here for Source link (https://www.springeropen.com/)