# Machine Learning–Driven Deduction Prediction Methodology for Power Grid Infrastructure Investment and Planning Yujie Wu, et al.

May 8, 2022

## 1 Introduction

Due to the continuous advancement of the reform of transmission and distribution prices, higher requirements have been placed on the accuracy of the annual investment plans of power grid enterprises (Bin et al., 2017; He et al., 2018; Lv and Yang, 2020); moreover, the wide application of information tools has also brought new challenges to improve investment management efficiency (Jiang et al., 2019; Sha et al., 2021). However, a literature search revealed few studies on investment management methods for infrastructure projects that satisfy the requirement of high-quality development of power grids. The preparation of annual investment plans within power grid enterprises only relies on manual experience, which is arduous to consider timing characteristics and annual investment laws comprehensively. Consequently, there is an urgent need to possess a research methodology suitable for the deduction of the investment scheduling of power grid projects and explore the inherent laws and characteristics of the investment schedule. This article brings some views on the deduction prediction methodology of power grid projects’ investment schedule considering project properties.

## 2 Project Duration Prediction Based on Neural Network

The construction period of the power grid infrastructure project provides a reference for the administrator to determine the deduction prediction for power grid infrastructure investment and planning. The project duration is required to be no less than a reasonable duration. Thus for power grid projects more petite than a reasonable construction period, the minimum value of the reasonable construction period of each voltage level project is taken as the project construction period. The appropriate construction period for the power grid project is 10–19 months for 110 (66) kV projects, 13–22 months for 330 (220) kV projects, 15–24 months for 500 kV, and 16–25 months for 750 kV.

The structure of the BP neural network is usually composed of three or more layers, namely, the input layer, hidden layer, and output layer. The prediction of project duration based on the BP neural network model is divided into the following steps: first, the BP neural network structure is reasonably designed; second, the neural network is trained according to the historical data; and third, the trained neural network is used for prediction. To predict the construction period of a power grid project, we must use historical construction period data. The input layer variables mainly include factors that significantly impact the construction period. Therefore, in the research, the voltage level, construction scale, construction attributes, and region factors of the infrastructure project are taken as the input to the network. The construction scale includes the substation capacity and the line length, the construction attributes consist of the project attributes and the engineering attributes, and the regional characteristic comprises the area where the project is located and whether it is in the urban core area. The data samples need to be normalized before establishing the prediction model, and the normalized data is between 0 and 1.

## 3 Analysis Measure of Power Grid Investment Schedule Curves Similarity Based on DTW Distance

The historical investment schedule curves data are overwhelmingly large for power grid projects, which are still being enlarged over time (Ma et al., 2018). The key of this research is to fully extract the characteristic factors of the massive historical investment schedule curves, investigate the deduction prediction methodology suitable for the investment schedule of power grid infrastructure projects, and then adjust the investment project library to make up for the shortcomings in the development of power grid. Nevertheless, several studies have documented that most time sequences analysis methods, including clustering algorithms, rely on distance measures. While comparing two investment schedule time sequences, the critical issue is how to deal with distortion, which is characteristic of time sequences. Time sequences data are associated with each other through time characteristics, which are different from ordinary static data. That is, in investment schedule time sequences data, the following data are affected by the previous data. It is paramount to retain the time characteristics of the investment schedule in data analysis and mining.

The traditional method of extracting quintessential investment schedule curves of power grid projects is to manually select a relatively centered and non-distorted corner among all investment schedule curves. However, the method is not universal and has a multitude of errors. In this article, a quintessential investment schedule model of the power grid projects is formulated by introducing the relevant theories and technical means of data mining and considering factors such as voltage level, construction scale, and construction attributes.

Time sequences distance or similarity measure is indispensable in deduction prediction for power grid infrastructure investment and planning, which is one of the standards to measure the similarity between different investment schedule curves and plays a critical role in the time sequences data mining. Similarity measures refer to the common shape in time series, which usually contain the common trend shape or pattern subsequence with common similarity at different time points. Different from Euclidean distance (ED), where distance is measured strictly according to the time sequences values corresponding to the exact moment, Dynamic Time Warping (DTW) exploits the thought of dynamic programming by adjusting the time sequence of the relationships among different moments of the corresponding element to obtain the optimal curve path, along this path of the distance between time sequences is the smallest (Li et al., 2019; Choi et al., 2020; Cai et al., 2021). This algorithm reasonably measures the overall shape similarity among time sequences. In comparison, DTW can better depict the general dynamic characteristics of the curves and applies to the situation where two curves have good overall similarities but are not completely aligned on the time axis. Consequently, it can effectively make up for the deficiency that ED only pours attention into the numerical distribution characteristics of the corresponding moments of the curves when describing the similarity of the investment schedule curves. DTW not only realizes the distance measurement of unequal time sequences but is robust to multiple shortcomings of time sequences. The smaller the DTW distance, the more similar the investment schedule curves. In this article, the abnormal curves of the investment schedule are eliminated with the DTW algorithm.

## 4 Investment Schedule Deduction Based on K-Shape and Neural Network

### 4.1 Insufficiency of K-Means Analysis

The distance measured in the classical clustering algorithm K-Means is Euclidean Distance. For the investment schedule curves, it is to calculate the sum of the squares of the investment schedule difference at each moment. This method of only calculating the investment schedule difference in the corresponding dimension cannot capture high-dimensional features, such as changes over time.

Given the defect that the ED of the investment schedule curves of power grid projects is arduous to reflect the high-dimensional characteristics of the curves, the kernel method can be utilized for optimization (Tang et al., 2019), in which the investment schedule curves are nonlinearly mapped into a high-dimensional feature space and then are clustered in the new high-dimensional feature space, increasing the probability of linear separability. Therefore, this clustering algorithm can calculate the distance according to the high-dimensional characteristics of the investment schedule curves, and the effectiveness of clustering is also optimized. However, this process is exceedingly cumbersome if the low-dimensional data is directly mapped to the high-dimensional space through a mapping function, and the calculation is performed in the high-dimensional area. A significant limitation of this approach is determining the kernel function, the data sparsity, and computational complexity.

### 4.2 Investment Schedule Curves Clustering

Some missions in the investment schedule time sequences analysis of power grid projects obtain a set of time sequences depending on the average sequence, which is the basis of the deduction prediction methodology for power grid infrastructure investment. The uncomplicated method of extracting average sequences is to calculate the arithmetic mean of the corresponding coordinates of all sequences, and the K-Means algorithm is adopted. The K-shape algorithm is different from K-Means in calculating the center of clusters and measuring the distance, having the characteristics of high precision and efficiency (Paparrizos and Gravano, 2017). The deduction prediction framework of power grid infrastructure investment and planning based on machine learning is proposed and presented in Figure 1.

FIGURE 1. Proposed deduction prediction framework of power grid infrastructure investment and planning.

The cluster centroids are calculated by cross-correlation statistics. Cross-correlation is a statistical measure that we can use to determine the similarity of two investment schedule curves. We assume that there are series

$x→=(x1,…xn)$

and

$y→=(y1,…ym)$

with different lengths and n < m, and zeros are added to the end of

$x→$

to make it the same length as

$y→$

.

$y→$

is kept fixed, and

$x→$

is plotted to compute the inner product with the corresponding points of

$y→$

in turn. The cross-correlation coefficient is defined as follows:

where

$Rω−m(x→,y→)$

is calculated as follows:

Here, R is used to calculate the similarity of x and y at each step, and the greater the value of R, the more similar the two sequences are.

#### 4.2.1 Normalized Cross-Correlation

Normalized Cross-Correlation (NCC) is used to describe the correlation between two samples, and the value range is (-1, 1). The smaller the NCC value, the less similar the two samples are; the larger the NCC value, the more similar the two samples are. The coefficient normalization is defined as follows:

$NCC(x→,y→)=CCω(x→,y→)R0(x→,x→)⋅R0(y→,y→).(3)$

#### 4.2.2 Shape-Based Distance

The smaller the shape-based distance (SBD) value is, the higher the sequence similarity is, and vice versa. The value range of SBD is between 0 and 2, and 0 means that the time sequences are entirely similar. The shape-based distance is defined as follows:

$SBD(x→,y→)=1−maxω(CCω(x→,y→)R0(x→,x→)⋅R0(y→,y→)).(4)$

#### 4.2.3 Time Sequences Centroid Calculation Based on SBD Distance Metric

The centroid of the investment schedule curves is also a time sequences line. Here the centroid calculation is regarded as an optimization problem whose goal is to confirm the minimum sum of the squares of distances from other time sequences. Since cross-correlation intuitively captures the similarity, not the incompatibility, the centroid calculation sequence is represented as the maximum square similarity to all other time sequences.

The distance metric and centroid calculation of the K-shape make it significantly better than K-Means. Initially, we randomly assign the input sequences of the investment schedule to clusters and then compute the centroid of each group. Then, the clustering method is achieved by iteration, and each iteration is divided into two steps. In the first step, each investment schedule sequence is compared with all the calculated centroids and is assigned to the cluster with the nearest centroid. The second step is to update the cluster centroids. These two steps are repeated until the algorithm converges or the maximum number of iterations is reached. Through this iterative process, K-shape minimizes the sum of squared distances and manages to generate uniform and well-separated clusters.

### 4.3 Investment Effectiveness Deduction

According to the estimated put into operation time in the project library, combined with the schedule rules, the milestone plans for construction are scheduled. At the same time, based on the deduction predicted outcomes of the power supply capacity, capacity-load ratio, “N-1”, new energy consumption (Ming et al., 20202020; Husin et al., 2021; Zhang et al., 2022), and other project investment benefits (Spyrou et al., 2017; Chen et al., 2020), the grid investment projects database is interactively adjusted until the expected goals are met, and the project investment plans are obtained. Here, BP neural network is also used for prediction, with project properties as the input and investment benefits as the output.

## 5 Discussion and Conclusion

The deduction of the investment schedule of power grid projects is a complex task with extensive data, heavy workload, and high technical requirements. Primary project information data such as voltage level, construction scale, construction attributions, and location are imported for the neural network to predict the project duration. Then the quintessential investment schedule curves are obtained by clustering the historical investment schedule curves with the K-shape algorithm. Finally, the investment schedule curve of the selected projects in the next year is mapped to the quintessential investment schedule curve of the corresponding category through the construction period.

Power grid projects are characterized by significant investments, long construction periods, and large reserve quantities. Scientific and practical deduction prediction methodology for power grid infrastructure investment and planning can help the company to reasonably arrange the power grid project library and investment allocation plan, improve the company’s construction investment efficiency, and play a crucial role in ensuring the steady development of power grid and promoting the company’s investment planning and construction process.

## Author Contributions

Writing the original draft and editing: YW. Conceptualization: XL. Formal analysis: LZ and CL. Visualization and contribution to the discussion of the topic: WZ and TZ.

## Funding

This work was supported by the State Grid Science and Technology Project (No.5100-202123009A).

## Conflict of Interest

XL and CL were employed by the State Grid Hunan Electric Power Company Limited Economic & Technical Research Institute.

The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

## Publisher’s Note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors, and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.