Coding Time Series Forecasting

Kelsie Merchant

Published May 28, 2021

Time Series Forecasting Introduction

Time Series Forecasting is a critical, but unique type of machine learning. Machine learning engineers are used to working with large sets of data to make predictions with machine learning models, but what makes a time series dataset different is the data also depends on the time between the data observations [1]. This extra time dimension adds key information and structure to the data. There are two main applications of working with time series data. The first is trying to understand the dataset, called time series analysis, and the second is using the dataset to make predictions, called time series forecasting [1]. For this week’s machine learning model, we’ll be focusing on the latter. Since time series data is usually set up as a stochastic process, Y(t), forecasting focuses on estimating Y(t + h) using information available at time t [2]. This involves fitting a model on historical data, then using the trained model to predict what will happen next [1].

To better understand time series forecasting, it’s important to know the components that make up time series data. All time series have a level, which serves as the baseline value if the series were a straight line. The usually linear increasing or decreasing behavior of the series over time is the trend of the time series [1]. Trend can be thought of as the line of best fit for the dataset and the insight into whether there is a positive or negative slope to that line. Sometimes the trend is easy to spot, but other times the trend may be nonexistent. When there is no change in mean and variance over time, the covariance is independent of time and the time series data is said to have stationarity [3]. Time series can also have seasonality, periodic fluctuations or cyclical nature of the data. One example is electricity consumption, which is typically high during the day and low during the night [3]. The repetitive highs and lows may not be as straightforward as a linear trend line, but they can provide vital information for making an accurate prediction. Finally, like other types of datasets, time series data typically is not perfect and noise can be expected [1]. While you will see below that we will try to minimize noise through preprocessing our data, we can still reasonably expect to have noise that cannot be fully explained by the model.

Graph displaying data trends, cycles, and noise

There are many different types of models for time series forecasting. The naive approach to time series modelling is the concept of a moving average, which forecasts the next observation as the mean of all past observations. While this approach is simple, it is helpful for identifying trends and can be applied to smooth the time series [3]. Another approach is similar. Exponential smoothing uses a weighted moving average, where less weight is placed on observations far from the present and greater weight is applied to more recent observations [3]. These basic ideas can be seen in more complicated models, such as AutoRegressive Integrated Moving Average (ARIMA), Seasonal ARIMA (SARIMA), Generalised AutoRegressive Conditional Heteroskedasticity (GARCH), Neural Network AutoRegression (NNETAR), and more [2]. The model used here works with Long Term Short Memory (LSTM), a Recurrent Neural Network (RNN) layer [4].

Recurrent Neural Networks (RNNs) are able to feed output from a previous step into the current step as input through hidden layers and hidden states that remember information about a sequence [5]. This “memory” can help make predictions because it gives insight into any patterns and trends found from the sequence over time. For each hidden layer, the previous hidden state is concatenated with the inputs, which allows memory to be passed along. This concatenation is then passed through a tanh layer, which keeps resulting values between -1 and 1. The outputs are then passed in as the hidden state to the next recursive call of the hidden layer [6].

Preprocessing Data

To start any good machine learning model, one should first preprocess the data, including cleaning any extraneous values and normalizing the data. First, the data used for this project is the coinbase data that can be found here. This data is loaded in with Pandas using the read_csv() method. Data with missing values can then be dropped to eliminate incomplete data. Since the early days of Bitcoin had inconsistencies in data collection, removing data with missing values also focuses a heavier emphasis on more recent data. The data is then split into three sections: the largest section is used for training (first 70% of the data), the next 20% is used as validation data to check the model is not overfitting to predict only the training data, and the last 10% is reserved for testing [4]. The data can then be normalized by subtracting the mean and dividing by the standard deviation. These statistical values are calculated with the training dataset and then applied to all three types of data. The training dataset is used because it is what will be utilized to train the model and it is the largest representation of the dataset. The training data, validation data, and testing data are then returned from the function and are ready to be utilized in our time series forecasting program.

Creating Windows and tf.data.Dataset as Model Inputs

As mentioned in the introduction above, the simplest model for working with time series deals with identifying patterns and trends in a subsection of the data over time. This not only gives information about what is the same between each subset of data, which can indicate patterns of seasonality, but they also can be used to see how values change from subsection to subsection, which can reveal overall trends. One way to think of these consecutive samples of data is that each subsection of data fits inside a window of data.

To start building the model, we will first introduce a WindowGenerator class. This class slices data to fit within a given width for the window, including keeping track of any column labels corresponding to each window of data. One method, split_window, converts list of consecutive inputs into window of inputs and window of labels. However, what may be even more important are the setters in this class that set the train, val, and test attributes to utilize tf.data.Dataset.

The benefit of tf.data.Dataset is that it sets (input_window, label_window) pairs and makes it easier to feed the data to our model for time series forecasting. This is achieved through the preprocessing.timeseries_dataset_from_array function [4]. This function inherently converts an array into batches of time series inputs and targets by applying the concept of a sliding window that divides up the data into subsections that can be tracked through the model [7].

Section of code that shows conversion into tf.data.Dataset and setting properties for train, val, and test

Long Short Term Memory (LSTM) Model Architecture

This model uses an RNN layer called Long Short Term Memory (LSTM). As introduced above, LSTM uses a recursive approach to a hidden layer that takes in both the inputs and the previous hidden state, which helps to preserve information about sequences. Yet LSTM goes beyond a simple RNN. LSTMs are intentionally designed to handle long-term dependencies. Instead of a simple single tanh layer, each recursive call in LSTM actually consists of four interacting layers [6]. The first layer acts as a "forget gate." It applies a sigmoid function to the input hidden state, where an output of 1 represents "completely keep" and an output of 0 means "completely forget." Since a sigmoid function outputs results between 0 and 1, this gate acts like a scale of how much the hidden state should be retained for the cell state. In other words, it indicates how much the past information will affect this pass of the hidden layer. Next is another sigmoid function that acts as the "input gate layer" by determining which values to update. In combination with the "input gate layer" is a tanh layer that creates a vector of new possible values that could be added to the cell state. These layers so far update the old cell state into a new cell state by multiplying the old state by how much we want to "forget" the old values and adding the vector of new possible values times how much value we determined those new candidate values could have. The last step is the output layer which takes the original hidden state and applies a sigmoid function. This is then multiplied with the calculated cell state run through a tanh function [6]. This approach allows us to keep certain parts of hidden state through a state space vector, which can help us "forget" the parts we don't need while keeping vital dependency clues long-term. LSTMs are considered to be more complex models, but they are beneficial when working with a large amount of data [2].

An LSTM model was chosen for this project for two main reasons. First, the given data for Bitcoin is large, and LSTM is a good model to chose when making forecasts for large number of time series [2]. Second, LSTM is well-documented, which made this model beginner-friendly to establish a firm foundation of time series forecasting [4].

Conclusion

Overall, writing a model for time series forecasting has shown me the power these models can have. Time series datasets go well beyond predicting Bitcoin values. They can be utilized to determine when big purchases will be at their lowest prices, can predict radioactive decay of elements on a more finite level, predicting the weather, estimating manufacturing outputs, and much more. It is crucial that we create and train these models with good practices, such as preprocessing the data, so that they are trained to more accurately help us plan for the future. I am excited to continue exploring time series forecasting applications in my own future and to see how models may be implemented behind the scenes into helpful predictors for everyday use, such as integrating time series forecasting modelling into IoT products to optimize energy efficiency and costs. I also think it would be cool to check out other types of RNN models and compare efficiency to this LSTM model to gain better intuition for choosing models when working with larger time series datasets.

To check out more of the code discussed here, please visit my GitHub repository.