Improve the accuracy of short-term forecasting algorithms by Standardized Load Profile and Support Regression Vector: Case study Vietnam

Short-term load forecasting (STLF) plays an important role in building business strategies, ensuring reliability and safe operation for any electrical system. There are many different methods, including: regression models, time series, neural networks, expert systems, fuzzy logic, machine learning and statistical algorithms used for short-term forecasts. However, the practical requirement is how to minimize the forecast errors to prevent power shortages or wastage in the electricity market and limit risks. The paper proposes a method of short-term load forecasting by constructing a Standardized Load Profile (SLP) based on the past electrical load data, combining machine learning algorithms Support Regression Vector (SVR) to improve the accuracy of short-term forecasting algorithms.


Introduction
Load forecasting is a topic of electrical systems that has been studied for a long time. There are two main approaches in this area: Traditional statistical modeling of the relationship between load and factors affecting load (such as time series, regression analysis, etc) and artificial intelligence, machine learning methods. Statistical methods assume load data according to a sample and try to forecast the value of future loads using different time series analysis techniques. Intelligent systems are derived from mathematical expressions of human behavior / experience. Especially since the early 1990s, neural networks have been considered one of the most commonly used techniques in the field of electrical load forecasting, because it assumes that there is a nonlinear function related to historical values and some external variables with future values may affect the output [1]. The approximate ability of neural networks has made their applications popular.
In recent years, an intelligent calculation method involving Support Vector Machines has been widely used in the field of load forecasting. In 2001, Bo-Juen Chen, Ming-Wei Chang, and Chih-Jen Lin [2] used the Support Vector Regression technique to solve the electrical load prediction problem (forecasting a maximum daily load of the next 31 days). This was a competition organized by EUNITE (European Network on Intelligent Technologies for Smart Adaptive Systems). Information was provided includes: demand data of the past two years, daily temperature of the past four years and local holiday events. Data was divided into 2 parts: a part used for training (about 80 -90%) and the rest used for algorithm testing (about 20-10%). The set of training inputs included: data of the previous day, the previous hour, the previous week, the average of the previous week. Their approach in fact won the competition.
Since then, there have been a number of studies exploring the different techniques used for optimizing SVR to perform load forecasting [3] - [10]. The main reason for using SVM in load forecasting is that it can easily model the load curve, the relationship between the load and the dynamics of changing load demand (such as temperature, economic and demographics).
However, there are some problems encountered when the above algorithms apply to reality: -Climate conditions always play an important role in load forecasting. They shows the relationship between climate and load demand. when we do the load forecasting for the post-test period, it is very difficult to forecast the values of weather and climate used as the input of the algorithm and these values are often not available.
-Electrical load samples include hidden elements, which tend to be similar to the previous load model. However, it will lead to a false forecast of the following days if the date pattern is different from the previous day or there is an event that impacts. Therefore, the use of the dataset (including data of the previous day, the previous hour, the previous week, the average of the previous week) has many risks if the load models are not identical.
-If the forecast time frame is greater than the past data frame (more than 07 days due to the algorithm data is the previous week's values), there will be a lack of input to run the algorithm.
-In addition, for Asian countries (such as Vietnam) that use lunar calendar, one of the most difficult and unpredictable issues is the Lunar New Year (usually in late January or early February), or the lunar calendar (Hung King's Anniversary), etc. There is a deviation between the solar calendar and the lunar calendar (the load models are not identical). Therefore, it often leads the forecast results of algorithm for this period with large errors.
For this reason, the paper proposes a solution to build a Standardized Load Profile (SLP) based on the historical load dataset as a training dataset. This input dataset is combined with the Support Vector Regression algorithm (SVR) to improve the accuracy of short-term forecast results, solve the problem of deviation between the solar and the lunar calendar, as well as overcome the input data frame.
SLP will be built for all 365 days and 8,670 cycles in a year. SLP will be an important dataset during training, testing and forecasting process. SLP will be built for all 365 days and 8,670 cycles in 1 year. SLP will be an important set of data during training, testing and forecasting. SLP will standardize load models by hours, by days, by seasons, and by special day types (including lunar dates). Therefore, SLP will contribute to solve the above mentioned difficulties and improve the quality of electrical load forecasting.

Methodology
Observing the load profiles of February of Ho Chi Minh City over the years (Figure 1), we see a huge fluctuation in chart shape over the years. This results in the use of historical data for forecasting this period of time is extremely complicated. In fact, the algorithms used to forecast in Vietnam have to go through an intermediary which is to convert these months into regular months (without holidays, Lunar New Year). After being calculated, the forecast result will be reversed or the result will be accepted with a large error. Commercial software provided by foreign countries all have this problem.

Standardized Load Profiles (SLP)
Observe the load profiles of the days in a week and some special holidays of the year in Ho Chi Minh City area (Figure 2), we see the difference between weekdays (from Tuesday to Friday) is not much and they have the same load chart. For the load profiles on Monday, they are different from the normal days at 0:00 to 9:00, due to the forwarding demand from Sunday.
For load profiles on Saturday, there is a change but not much compared to normal days, mainly the load demand decreases in the evening due to the start of the weekends. Particularly for load profiles on Sunday, it is completely different from normal days (the demand for electricity is low). When observing the load chart of the New Year and Lunar New Year, we see the difference completely, the graphs are almost flat and the load demand is quite low because these are holidays. Particularly on Lunar New Year, the load demand is the lowest, because this is the longest holiday of the year (may be from 6 to 9 days).
Standardized Load Profiles (SLP) are built by taking the value of the collected capacity in a 60 minute period divided by its maximum capacity. We need to build SLP for 365 days per year. Some typical SLP ( Based on the SLP of each cycle of the past data set, we can build the SLP data set for future forecast periods. This should be accurate to each cycle, each type of day (holidays, weekdays, working days, holidays, etc), each week and month. Therefore, the standardized load profiles (SLP) is a special feature and is also an important input parameter of the SVR (NN) machine learning algorithms training process to rebuild the load curves, from which we can estimate data lost or not recorded during the measurement process.

Support vector regression (SVR)
The feature of SVR is that it provides us with a sparse solution. That is, to build the regression function, we do not need to use all the data points in the training set. The points that contribute to the construction of the regression are called Support Vectors. The layering for a new data point will depend only on the support vectors [5] - [6].
The regression function has the formula: Thus, the goal of SVR training is to find w and b [7] - [10]. For the training set {(x1, t1), (x2, t2), …, (xN, tN)} R R n   . With a simple regression problem, to find w and b we have to minimize the normalized error function: To get a sparse solution, we will replace the above error function with the ε-insensitive error function. The characteristic of this error function is that if the absolute value of the difference between the predicted value y(x) and the target value is less than ε (with ε > 0) then it is considered the error is zero. Now, we must minimize the normalized error function: To allow some points outside the tube ε (Figure 4), we will add slack variables. For each data point n x , we need two liquid variables n 0   and n 0   , which n 0   correspond to the point that n n t y(x )   (outside and above the tube) and n 0   correspond to the point that n n t y(x )    (outside and below the tube).
Where k is the kernel function: Maximizing with constraints: Thus, for SVRs using the ε-insensitive error function and the Gaussian kernel function we obtain three parameters: the normalization coefficient C, the parameter γ of the Gaussian kernel function, and the width of the pipe ε [7]. All three parameters affect the prediction accuracy of the model and need to be selected carefully.
-If C is too large, it will give a priority to the training error, leading to a complex model, it is easy to be over fitting. If C is too small, it gives a priority to the complexity of the model, leading to a too simple model, reducing the prediction accuracy.
-The meaning of ε is the same. If ε is too large, there are less support vectors, making the model too simple. On the other hand, if ε is too small, there are many supporting vectors, leading to complex models, which are more likely to be over fitting.
-The γ parameter reflects the correlation between the support vectors and also affects the prediction accuracy of the model. Processed historical data (power consumption, capacity, temperature recorded at 24 cycles -60 minutes each) with the Standardized load Profiles (SLP) will be included in modules to build regression functions under SVR (Support Vector Regression), NN (Neural Network) algorithms to build regression functions ( Figure 5).

Research models
Then we use the above data set to check and evaluate the error of regression functions. After that we choose the regression function with the smallest error will be used as regression function for the next forecast phase.
The SLP data set in 24 cycles of the expected period (including holidays, etc.) and the forecasted temperature in 24 cycles of the corresponding period will be the input for the regression function that is selected to export forecast results in 24 cycles for a period of 7 -30 days.

Input data
The article uses data from January 1, 2015, to November 17, 2018, of EVNHCMC to run test models. After pretreatment, the data set is divided into 2 volumes: training set and test set, in which the test set is the last 30 days of the data set. Or the data set is divided into phases to test the forecast results in different time periods.
Input data for training algorithms include: capacity (Pmax/Pmin) in 60-minute cycles; temperature (max / min) in 60-minute cycles; standardized load profiles of 24 hours of day; list of holidays and Lunar New Year in the forecast year.
A useful measurement parameter is the mean absolute percentage error (MAPE) which is used to evaluate the error of models.
The algorithms are programmed in Mathlab language and the results are exported to Excel files for data exploitation.

SVR Models
It is necessary to correctly select the input parameters to run SVR models: normalization coefficient C, width of pipe ε and Gaussian kernel function ( Table 1). The algorithm uses the same input data set of models. Some typical SVR model parameters are proposed:

RFR models
A set of regression trees with each set of different rules to perform non-linear regression. The algorithm builds a total of 20 trees, with a minimum leaf size of 20. The size of leaves is smaller or equal to the size of the tree to control overfitting and bring about high performance [13] - [14]. The algorithm uses the same input data set of models.

Neural Network Models
We use Feedforward Neural Network models with the input variables and training data set as above. Ahidden-layer network architecture with class size of 10 and Sigmoid activation function is used. At the same time, the usual Neural network with 3-hidden-layer network architecture, in which: the first hidden layer has a size of 10; The second hidden layer has a size of 8 and the third hidden layer has a size of 5.

Results and analysis:
Run the forecast results for February 2018 (the month of the Lunar New Year) to assess the degree of error of the models

The model with input is the load of the last day, last week, last month
Processed historical data (power consumption, capacity, temperature recorded at 24 cycles -60 minutes each) with the Standardized load Profiles (SLP) will be included in modules to build regression functions under SVR, Neural Network and Random Forest algorithms to build regression functions ( Figure  6).  We choose the regression function with the smallest error will be used as regression function for the next forecast phase ( Table 2). The model Yts4 is selected to be a forecasting model.

 Forecast results for February 2018
Considering the forecast results for February of the model, we see a big difference between reality and forecasting ( Figure 7). The reason is that we used the historical data of January 2018 (7-14-30 days before the forecasting date) as the input for the training model.      We choose the regression function with the smallest error will be used as regression function for the next forecast phase (Table 3,  Thus, through experimentation, we see that the use of Standardized Load Profile (SLP) as the input dataset for modules of the forecasting regression function is effective and give forecasting results with low errors. It solves the problem solve the problem of deviation between the solar and the lunar dates, especially in the months of lunar new year, as well as resolving the difference between the solar and lunar cycles.