Using Machine Learning to Predict Retail Gasoline Prices

For inflation traders, one of the key risks is energy prices. What makes this more interesting than your typical hedge able risk is your actual risk is what the BLS thinks gasoline prices did, not the observable futures prices. Therefore all good TIPS traders have a model predicting retail gasoline prices.

So that is what we are trying to predict. If the spread between RBOB futures and Retail prices was constant, this would be trivial. As you can see, it is not:

It varies from 40c to 120c, and while there is a seasonal pattern to the spread, the noise is quite large.

To tackle this problem, I created a Long Short-Term Memory (LSTM) recurrent neural network.

(I would like to thank Dr. Jason Brownlee and his https://machinelearningmastery.com site. I could not have done this without his tutorials. Much of the code used here was adapted from https://machinelearningmastery.com/multivariate-time-series-forecasting-lstms-keras/)

To do this I will use Keras (https://keras.io/) using a Tensorflow backend.

Step 1 — Prepare the machine

Into my Python 3.6 virtual environment, I install: numpy, pandas, scipy, sklearn, tensorflow, and Keras. If you want to visualize the network you need to install into the venv pydot and graphviz as well as install graphviz on the machine (via apt-get)

Step 2- Prepare the Data

The raw data is a csv file that has three columns, the date, the XB2 price and the Average retail gasoline price. That is all the data I am going to use. Since I know the date format, it is easy to create the dataframe

import pandas as pd
def parse(x):return pd.datetime.strptime(x, '%Y-%m-%d')
dataset = pd.read_csv('gasoline.csv', index_col=0, date_parser=parse)

Which creates:

               XB2  Retail
Date                      
2011-12-30  265.74   327.8
2011-12-31  265.74   327.9
2012-01-01  265.74   327.9
2012-01-02  265.74   328.8
2012-01-03  275.34   331.9

Since I want the network to learn (hopefully) any seasonality in the data, I create two more numeric columns, one for the month and one for the date.

datadates = dataset.index.values
datamonths = pd.Series(data=[pd.to_datetime(x).month for x in datadates], index=datadates, name='month')
datadays = pd.Series([pd.to_datetime(x).day for x in datadates], index=datadates, name='day')
datamonths = datamonths.to_frame().join(datadays.to_frame())
dataset = datamonths.join(dataset)

Which now gives me the 4 column dataframe of

             month  day     XB2  Retail
2011-12-30     12    30  265.74   327.8
2011-12-31     12    31  265.74   327.9
2012-01-01      1     1  265.74   327.9
2012-01-02      1     2  265.74   328.8
2012-01-03      1     3  275.34   331.9

I want to take advantage of the LSTM network, so I am going to pass in the prior 12 days of data to predict the current retail gas prices. In other words I am going to have 48 inputs into my model to try and predict my one output. The 48 inputs will be variable t-12 through variable t-1 for all the data to predict Retail at time t.

Using Dr. Jason Brownlee’s function series_to_supervised we create our matrix.

# frame as supervised learning
reframed = series_to_supervised(values, 12, 1)
# drop columns we don't want to predict (ie month, day, xb2 on day t)
reframed.drop(reframed.columns[[48, 49, 50]], axis=1, inplace=True)

Now we have a matrix that looks like this (just a snippet):

var1(t-12)  var2(t-12)  var3(t-12)  var4(t-12)  var1(t-11)  var2(t-11)  \
12        12.0        30.0  265.739990  327.799988        12.0        31.0   
13        12.0        31.0  265.739990  327.899994         1.0         1.0   
14         1.0         1.0  265.739990  327.899994         1.0         2.0   
15         1.0         2.0  265.739990  328.799988         1.0         3.0   
16         1.0         3.0  275.339996  331.899994         1.0         4.0

Next step is we have to normalize everything between 0 and 1. sklearn’s MinMaxScaler seems like the right tool for the job:

scaler = MinMaxScaler(feature_range=(0, 1))
scaled = scaler.fit_transform(reframed)

Now it comes time to split our data into train and test data sets. We have 5 years of data, so I am going to have the model train on the first two years and test on the last 3. We then turn the data sets into the 3d shapes that Keras expects.

# split into train and test sets
values = scaled
n_train_days = 2 * 365
train = values[:n_train_days, :]
test = values[n_train_days:, :]
# split into input and outputs
train_X, train_y = train[:, :-1], train[:, -1]
test_X, test_y = test[:, :-1], test[:, -1]
# reshape input to be 3D [samples, timesteps, features]
train_X = train_X.reshape((train_X.shape[0], 1, train_X.shape[1]))
test_X = test_X.reshape((test_X.shape[0], 1, test_X.shape[1]))

If we look at the shape of our data we can see that:

print(train_X.shape, train_y.shape, test_X.shape, test_y.shape)
(730, 1, 48) (730,) (1363, 1, 48) (1363,)

So we have 730 days of 48 observations and results in our training set, and 1363 days in the testing set. Now we need to create our model. This is based on Dr Brownlee’s Multivariate Time Series Forecasting with LSTMs in Keras model.

We will define the LSTM with 50 neurons in the first hidden layer and 1 neuron in the output layer for predicting pollution. The input shape will be 1 time step with 48 features.

We will use the Mean Absolute Error (MAE) loss function and the efficient Adam version of stochastic gradient descent.

The model will be fit for 50 training epochs with a batch size of 91.

Finally, we keep track of both the training and test loss during training by setting the validation_data argument in the fit() function. At the end of the run both the training and test loss are plotted.

# design network
model = Sequential()
model.add(LSTM(50, input_shape=(train_X.shape[1], train_X.shape[2])))
model.add(Dense(1))
model.compile(loss='mae', optimizer='adam')
# fit network
history = model.fit(train_X, train_y, epochs=50, batch_size=91, validation_data=(test_X, test_y), verbose=2,
                    shuffle=False)

Or in fancy machine learning speak.

Running the model we get the following loss chart


The total RMSE of this model is 5.45 which is around a half of a cent. Not too bad for a basic and not highly tuned model. To show the prediction we basically unscale the results of the model (remember the model predicts 0,1)

# make a prediction
yhat = model.predict(test_X)
test_X = test_X.reshape((test_X.shape[0], test_X.shape[2]))
# invert scaling for forecast
inv_yhat = np.concatenate((test_X[:, 0:], yhat), axis=1)
inv_yhat = scaler.inverse_transform(inv_yhat)
inv_yhat = inv_yhat[:, -1]
# invert scaling for actual
test_y = test_y.reshape((len(test_y), 1))
inv_y = np.concatenate((test_X[:, 0:], test_y), axis=1)
inv_y = scaler.inverse_transform(inv_y)
inv_y = inv_y[:, -1]
# calculate RMSE
rmse = sqrt(mean_squared_error(inv_y, inv_yhat))

As we can see, it does a very good job:

If we plot them versus each other we see very little bias, and some room for improvement (especially in the lower left)

As I mentioned, this is a basic model, there are lots of ways to make it better, but not bad for a start.

This prediction model will have huge demand considering that -many oil rich companies and even rich Arab's want a reliable algorithm to predict price of all petroleum variants including retail gasoline !!

To view or add a comment, sign in

More articles by Jacob Bourne

  • Visual Agent Builder

    One of the benefits of being a curmudgeon is seeing through hype—especially if you understand the math underneath it…

    1 Comment
  • Amazon Should Know Better

    Amazon rolled out Chronos‑2 with the kind of enthusiasm usually reserved for real breakthroughs, which makes the whole…

    1 Comment
  • Beware of Geeks Bearing Gifts

    GitHub Repo This didn’t start as an effort to help my old inflation-trading friends, but somehow it ended up as one…

    3 Comments
  • Reflections on Trusting LLMs

    Author’s Note: This piece was inspired by this YouTube video: The Original Sin of Computing..

    1 Comment
  • Beating Diabetes with Math — An Optimization Playbook (with a Small ML Demo)

    Scarcity, incentives, and constraints aren’t just the backbone of economics; they’re the backbone of self-discipline…

    3 Comments
  • SOAR: Embracing Chaos in AI Memory and Context Retrieval

    AI context is a subject I find endlessly fascinating, and this piece marks my second attempt to push the boundaries of…

    3 Comments
  • Why AI Worries Me

    Since the machines are coming for all our jobs, I wanted to give one last warning about them. Unlike most of the…

    4 Comments
  • The Evolution of PR Reviews: From Pain to AI-Powered Precision

    I hate PR reviews. I hate giving them—does Shakespeare grade elementary school essays? I hate receiving them—did Moses…

  • So I Decided to Make a Brain

    I recently embarked on an ambitious project: building a brain. With the current buzz around Large Language Models…

    4 Comments
  • GitAI or How I Stopped Writing Commit Messages

    This is a little project combining two hot things right now: ChatGPT and Rust. I've been meaning to learn Rust for some…

    2 Comments

Others also viewed

Explore content categories