# How unique is weekly smart meter data? – Energy Informatics

#### ByDejan Radovanovic, Andreas Unterweger, Günther Eibl, Dominik Engel and Johannes Reichl

Sep 7, 2022

The methodology described in the “Methodology” section is applied to two different scenarios: subsets of 25 and 100 households representing a small and a medium-sized residential area, respectively. Before discussing the results, we describe the parameter selection for our proposed method for the two scenarios.

### Parameter selection

The selection of parameters is described and demonstrated for the small residential area consisting of 25 households. The medium-sized residential area consisting of 100 households is derived analogously.

Only two parameters need to be selected—p, the number of dimensions used for the fingerprint (step 4 in Fig. 1), and k, the number of neighbors considered during matching (steps 5 and 6). p is varied between 5 and 50, and k is varied between 1 and 15. Figure 2 illustrates the median accuracy (Y axis) over all weeks of all households with respect to the different values of p and k.

As can be seen, the dependency of the accuracy on p (X axis) is much more pronounced than on k (pluses for the training set and crosses for the test set, respectively) for any particular value of p. For the sake of visibility, only (k=1,5,9,13) are depicted as single points. The fact that all four points are very close to each other for all values of p shows that the dependency on k is weak. Thus, (k=1) neighbor is chosen for sake of simplicity.

However, the effect of p is significant: Fig. 2 shows that (p=5) dimensions are too small since the accuracies of both, the training and the test set are small. This indicates that not enough of the available information is used to distinguish different households. For (10 le p le 20), the accuracies of both, the training set (dash-dotted, light grey line) and test set (solid, dark grey line), increase. For larger values of p, the training accuracy stays high, but the test accuracy drops compared to the training set. This indicates that the features learned during training are mostly specific to the households of the training set and do not generalize to the households of the test set.

Thus, a value of (p=25) is used for the small scenario with 25 households. Analogously, a value of (p=20) is used for the medium-sized scenario with 100 households as can be seen from Fig. 3. The value of (k=1) is used for both scenarios.

### Matching performance

In this section, the matching performance achieved with the parameters selected in the previous section is assessed. Figure 4 depicts the per-household matching accuracy for the training set (light grey) and the test set (dark grey) for both scenarios (25 and 100 households per set, respectively). The black dots illustrate the individual per-household matching accuracy.

The overall accuracy within the test set (dark grey) is surprisingly high, considering the simplicity of the approach and the difficulty of the corresponding classification problem with 25 and 100 classes, respectively. For reference, guessing the correct household (class) randomly is expected to yield an accuracy of (textit{acc}_{h,rand}=1/n Test), i.e., (4%) and (1%) for a 25 and a 100-household-sized set, respectively. This reference (guessing) accuracy is depicted as thick dashed lines in Fig. 4.

Compared to random guessing, the median accuracy of the proposed methodology is between roughly 16 and 35 times higher on average for the small and medium-sized residential areas, respectively. Note that the difficulty of the problem increases with the number of households. This explains why the accuracy is lower for the case of a medium-sized residential area compared to the small residential area. Yet, the performance of the proposed methodology is significantly better in the medium-sized case relative to guessing.

The black dots in Fig. 4 depict the matching accuracies of the individual households. For some households, the accuracy is nearly 100(%) which implies that the corresponding household can be identified based on an arbitrary single week of a year. This is surprising as one would expect the seasonal differences to have a significant impact on the consumption patterns throughout the weeks of a year.

The subsequent privacy implication is that some households exist which can be identified very easily from a single, arbitrary week’s worth of energy consumption with an approach that uses off-the-shelf algorithms. A number of other households cannot be detected well, i.e., they have a low matching accuracy. However, the matching accuracy for these households is still much better than guessing.

### Extreme households

The question arises what makes the identification of a household easier or harder, i.e., why the matching accuracy is relatively high or relatively low, respectively. As a first attempt to answer this question, a preliminary descriptive analysis is provided.

Based on the matching results, the most extreme households, i.e., those with the highest and the lowest matching accuracy, are visualized. The consumption data of a whole year of a household is illustrated as a heatmap. The X axis denotes the days of the year from left to right, the Y axis denotes the time of day from top to bottom in intervals of 15 min. The color of each 15-min interval depicts the associated energy consumption in kWh. Dark (purple) represents 0 kWh and bright (yellow) represents 1.4 kWh.

Figure 5 shows the energy consumption for the household with the highest matching accuracy within the test set of the small residential area. One can see that the consumption is quite regular, i.e., the consumption barely changes between weeks of the periods from April to November, and December to March, respectively. Note that the apparent 1-h time shifts in March and October are mostly likely due to daylight saving time.

The regular rectangular areas might be from a pool pump as proposed in Burkhart et al. (2018). While this pattern is not the same throughout the whole year, it is comparatively regular over periods of multiple weeks. This suffices as the proposed methodology only needs to find one of the few similar weeks. The identifiability seems to be related to periodic behavior due to the dominance of Fourier and Wavelet features but requires further investigation in future work.

Figure 6 shows the household with lowest matching accuracy within the test set of the small residential area. While its consumption is comparatively regular over the year, it does not show any remarkable features which appear over multiple consecutive weeks. Thus, with the proposed methodology, any given week of this household shares more similarities with weeks from other households than it does with weeks from the same household.

The extreme households of the medium-sized residential area exhibit similar characteristics to those of the small residential area described above. For the sake of completeness, the corresponding heatmaps are visualized in Figs. 7 and 8.

Note that this analysis is a first attempt of an explanation. Future analyses might offer further insight into the relevant household-specific characteristics which impact matching accuracy.