Using an MLP to Predict CPI
Deep learning is a popular tool in the data scientist's toolkit. While I generally prefer gradient boosted trees, knowing how to use multi-layer perceptrons models is certainly useful. The following link at Machine Learning Mastery goes into this topic in more detail. Consider this applying real world data to the methodology at that site.
The first step is getting the CPI data. This should be easy but its not, the BLS insists on putting semi annual and annual numbers in the raw time series. FRED is better, but its difficult to find the right FRED codes for some of the more obscure series. Luckily, Python is here and its scripting abilities are fantastic.
#first we need to get the data from the BLS website to create our dataframe
#we are going to get core and some level 2 series for another project
tickers = {'CUSR0000SAF11':'Food at Home',
'CUSR0000SEFV':'Food away from Home',
'CUSR0000SACE':'Energy Commodities',
'CUSR0000SEHF':'Energy Services',
'CUSR0000SACL1E':'Core Goods',
'CUSR0000SASLE':'Core Services',
'CUUR0000SA0L1E':'Core'}
#done to get rid of the semi annual and annual points
VALID_PERIODS = ['M01', 'M02', 'M03', 'M04', 'M05', 'M06', 'M07', 'M08', 'M09', 'M10', 'M11', 'M12']
#get the data from the BLS website
URL = 'https://download.bls.gov/pub/time.series/cu/cu.data.0.Current'
content = content = urllib.request.urlopen(URL)
# create holder lists
series = []
periods = []
values = []
i = 0
#ignore the first line
first_line = True
for line in content:
if (i%25000) == 0:
print('{} rows processed'.format(i))
i+=1
if first_line:
first_line = False
else:
#the BLS tab separates
tokens = line.split(b'\t')
#series id is the first one
id = tokens[0].decode("utf-8").strip()
#now the date stuff
if id in tickers.keys():
year = tokens[1].decode("utf-8").strip()
period = tokens[2].decode("utf-8").strip()
value = float(tokens[3].decode("utf-8").strip())
#this is to get rid of the semi annual and annual periods
if period in VALID_PERIODS:
period = int(period[1:])
#create a pandas period
period = pd.Period('{}-{}'.format(period, year))
series.append(id)
periods.append(period)
values.append(value)
data = pd.DataFrame({'SERIES': series,
'PERIOD': periods,
'VALUE': values})
data = data.pivot('PERIOD','SERIES','VALUE')
#just to make things pretty
data.columns = [tickers[x] for x in data.columns]
data.columns = [tickers[x] for x in data.columns]
When run it will process the raw CPI data and create a nice pandas data frame with the period as the index, and the different indices as columns. There are a little more than 125,000 rows to process so give it a little time to run through all the data.
The next step is to transform the data into the X and y data sets for the model to use. The key here is we are going to use the prior 36 months of data to predict the next twelve. Machine learning mastery had the basic script to do this in a programmatic way
def split_sequence(sequence, n_steps_in, n_steps_out):
X, y = list(), list()
for i in range(len(sequence)):
# find the end of this pattern
end_ix = i + n_steps_in
out_end_ix = end_ix + n_steps_out
# check if we are beyond the sequence
if out_end_ix > len(sequence):
break
# gather input and output parts of the pattern
seq_x, seq_y = sequence[i:end_ix], sequence[end_ix:out_end_ix]
X.append(seq_x)
y.append(seq_y)
return np.array(X), np.array(y)
Now because I am using real data, we have to split it into some sort of train/test split, so I have the data prior to 2015 as train and since 2015 as test.
raw_seq = data.Core[:'2014'].values
test_seq = data.Core['2014':].values
# choose a number of time steps
n_steps_in, n_steps_out = 36, 12
# split into samples
X, y = split_sequence(raw_seq, n_steps_in, n_steps_out)
test_X, test_y = split_sequence(test_seq, n_steps_in, n_steps_out)
print(test_X[-1], test_y[-1])
[239.413 239.248 238.775 239.248 240.083 241.067 241.802 242.119 242.354
242.436 242.651 243.359 243.985 244.075 243.779 244.528 245.68 246.358
246.992 247.544 247.794 247.744 248.278 248.731 249.218 249.227 249.134
250.083 251.143 251.29 251.642 251.835 252.014 251.936 252.46 252.941] [253.638 253.492 253.558 254.638 255.783 256.61 257.025 257.469 257.697
257.867 258.012 258.429]
So now we have as our input a series of 36 months longs core CPI index values to try and predict the next 12. Using keras as the front end to tensorflow it is easy to create our model.
model = Sequential()
model.add(Dense(1024, activation='relu', input_dim=n_steps_in))
model.add(Dense(n_steps_out))
model.compile(optimizer='adam', loss='mse')
# fit model
model.fit(X, y, epochs=2000, verbose=0)
So how did the model do, well here is the most recent twelve months, predicted and actual. Remember it made these predictions with the data up to September 2017
Not too bad. This is a trivially simple model to do this with, going forward we will use multiple series as the input, not just prior core values. However, it looks like mlp models are useful for modeling CPI