Forecasting Electricity Consumption

Project realization: April, 2023.

The project

When I attended the Business Forecasting class at the University of Southern California, I participated to a Data Science competition in teams. The goal was to predict a year of hourly consumption of electricity of a city for an hypothetical energy supply company. More precisely, the dataset contained a time-series of four years, from 2002 to 2005, of hourly consumption, together with four different measures (average, median, minimum, maximum) of the temperature registered during that hour.

Besides forecasting hourly consumption for 2006, each team had to provide predictions of the daily peak load and the hour of the day it would have happened.

Exploratory data analysis

First of all, me and my team have carried out exploratory data analysis of the dataset. In particular, we wanted to investigate the different layers of seasonality that were present in the time-series. Being the data hourly data, we hypothesised that electricity consumption could vary on different levels:

According to the hour of the day, being probably lower at night

According to the day of the week, following different consumption patterns on weekends

According to the month, being energy mainly used for cooling and A/C during summer, and for heating during winter

Understanding in depth this different levels of seasonality has revealed to be essential to better model the time-series. For example, we have verified that weekly seasonality seems to be not statistically significant, and that the relationship between demand of electricity and temperature is flipped between summer, when to higher temperatures corresponds higher consumption, and winter, when the opposite is true.

Feature engineering

After having explored the time-series, the second phase of the project has been the construction of features that could be useful for the predictive models we would have used. First, me and my team constructed dummy variables to model the different layers of seasonality, like the hour of the day, the day, the month, the year, the quarter, and so on. Moreover, we have constructed a trend variable and lagged values of past energy consumption and past temperature.

Model selection

We have divided the dataset into a train set, the years 2002-2004, and a test set, 2005. This way we could better estimate the performance of the models on new data. Then, we have trained and fine-tuned several different models, for example:

Linear Regression
Exponential Smoothing
Time Series Seasonality decomposition
SARIMAX
Support Vector Regression
Neural Networks
Prophet
XGBoost

However, after various attempts, the performance of these models was still unsatisfactory (between 10% and 15% MAPE). For this reason, we have slightly changed our approach.

After having conducted some research, me and my team have decided to improve our XGBoost model by including some Fourier Series Components to further support the modeling of daily, weekly, and monthly seasonal effects. Moreover, we have switched to a 3-fold time-series cross-validation to better estimate the generalization capability of the model. This way, the precision of predictions increased, achieving a MAPE of 4.96%.

Final forecasts

We have used our champion model to generate predictions for the year 2006. Regarding the daily peak load and its timing, we have decided to extract those predictions from the hourly predictions generated from the XGBoost model. The reason behind this choice is that, despite having trained some acceptable ad hoc models, there was the risk of delivering predictions that contradicted the forecasts of the XGBoost model. In our opinion, if the forecasts were used by a company, the indications should have been coherent, while carrying out an acceptable forecast error.

🌙 Modalitá notte