Prediction Accuracy Optimization Techniques
Prediction accuracy is the key goal for many data science and forecasting applications and optimization techniques can be applied to improve it. No prediction problem is quite the same as another; however some overriding concepts can be applied with the aim to minimize the error of the predicted versus actual result. The subject matter around model fitting for prediction methods is vast and this is an overview of some useful concepts to consider.
Data classification and clustering
Data classification can help to break down tasks in predictive analytics while also providing valuable information on data subset metrics and their interrelationships. It is simplistic to view data as one object with a dependent variable to be predicted as a function of multiple independent variables. Approaching prediction analytics this way is worthwhile but ignores potential value from data subset characteristics that can be revealed through slicing of the data through classification.
Classification is supervised learning since attributes of the data are already known for categorizing data variable subsets. A simple and common example is sales data by date with linked customer, product and geographical variables.
We can easily classify the data into subsets for male and female by gender, age and address town as well as aggregate geographical levels. Continuous numerical data can also be classified into groups by employing factorization such as age into age group.
When we are unable to classify data explicitly, unsupervised learning can be undertaken via data clustering. Clustering of variables into groups can be undertaken through machine learning automation such as neural network analysis. This approach will test interrelationships and define data subsets with common characteristics.
Predictions can be made on each of the defined classifications and clusters to estimate an outcome for each of the subsets. In the example, we can have separate predictions by age group and gender which can then be aggregated and weighted by expected demographic dynamics in the forecast. The categorical predictions also provide useful analysis for marketing, resource allocation and other management decision making.
Establishing new variables
More possibly relevant variables, that can be identified and cleanly sourced, can help improve the overall accuracy of predictions as their explanatory trend dynamics are integrated. Logical cause and effect rationale can be utilized to seek out data that may conceivably impact on any of the existing variables and their categories. In the example, we may want to source external data at demographic category levels such as average income for age bracket, gender and geographical region.
In addition to establishing potential new variables from external sources, variables can be created from already sourced or internal data. Ratios, indices and aggregates from existing variables can serve to simplify dynamics and assist predictive models to find meaningful relationships. Automated clustering will accomplish this to a degree but often human reasoning can define more logical variable mashups. Following the example, one could create age aggregates for young, middle-aged and old for each gender and make ratios of these against sales quantity and disposable income.
Time lag optimization
One of the major hurdles for any prediction or forecasting model is the lack of forward out of sample estimations for the independent variables. One can establish a predictive equation from time matched data until the current date but, without any sense of how the explanatory variables will pan out in the future, the analysis is of limited value. To optimize for accuracy, variables need to be separated into those that can be controlled internally of confidently estimated and leading indicators that have a lag effect on the dependent variable.
Internal variables that can be controlled can also serve for effective scenario analysis and, in the example, if marketing spend is found to have a high correlation with sales growth then we can use it effectively in the predictive model. If external variables have reliable future estimates than they can also be used with a time lag for forecasting. Reliability of third party estimates can only be truly identified from track record of historical estimation accuracy.
For leading indicators, a common approach is to offset the variables by one time period in order to provide one real observation for the prediction period. This method ensures that the most recent data possible is used for training the prediction model but may not reflect the optimal lag time.
In many cases, the optimal lag effect of leading indicators varies across independent variables and is not a 'one size fits all' scenario.
In the example, it may be that disposable income has a 3 period optimal lag before affecting sales volume while marketing expenditure has a 1 period lag. The optimal lag time for each variable can be identified by independent and combination trend analysis with each variable then offset accordingly to improve prediction accuracy.
Feature selection
Feature selection the process of selecting variables for the prediction model to increase accuracy and reduce noise. Simplifying the model with feature selection is important because removal of noise improves the confidence range for the forecast results and can also decrease processing time due to a lower number of input variables.
For linear models, a start is to remove multicollinearity using a variance inflation factor filter. In feature selection, it is important to analyze groups of variables and not simply each one independently since it is the combination of variables that will have the most influence on the outcome accuracy. The classification and clustering methods can help here along with stepwise variable elimination testing.
For non-linear models, feature selection methods are defined based on the model used itself. In many cases, model tuning routines exist to undertake feature selection for the specific tree based or machine learning based prediction algorithms. These routines should be undertaken for each model used in the prediction model.
Multiple model ensembles
Diversification is a key concept in business strategy and the same applies for predictive modeling. Complex data is likely to contain both linear and non-linear relationships and multiple approaches can assist to isolate each in an optimal manner to boost predictive strength.
Popular prediction models range from simple multiple linear regression to advanced machine learning algorithms and include random forest, generalized boosting models and non-linear models. These are often combined to create an ensemble prediction model by weighting by the inverse of the error margin for each model fit. This is, however, limited to the in-sample training data and a more robust approach is to test the predictive accuracy of each model out of sample. In this approach, the last few observations are left out of the model building and then the model is tested on the out of sample walk forward data. More advanced approaches can remove multiple random sections of data from each model to fully test the out of sample accuracy and use these results to weight each model’s predictions.
Multiple models can also be used for subsets of the data as characteristics of one subset may be better suited to a particular model. The classification and clustering work can be used here to identify optimal models for each subset and an R package is designed for this approach.
Cross validation feedback loop
Prediction models need to adapt to changing dynamics and checking actual outcomes against those that were predicted should be a part of the ongoing modeling process. The feedback loop can help identify which parts of the model are working well and which parts are failing to deliver on accuracy so that modifications can be made to optimize the accuracy.
Often predictive analytics optimization focuses solely on the modeling and cross validation of actual results; however the feedback loop should accommodate all aspects of the process. This includes the data collection and variable creation stages as it may be identified that a particular component is lacking information or another is particularly strong an warranting the investigation of similar data availability. In the example, it might be that the disposable income by region is a particular strong predictive variable for sales volume and so the same data might be worth collecting at another demographic level such as by gender.
It is also worthwhile to think outside of the current system to what other data could be collected to boost the model accuracy. Perhaps metrics on customer behavior in the store or on the web site can start to be collected for future introduction into the model.
These techniques by no means exhaustive as there is a wealth of academic and online literature for improving prediction accuracy. These concepts are an overview of some of the most important factors to consider to optimize accuracy in predictive analytics.
Great write up
Russ, great article. This is very helpful info. Being in a business development sales role using these techniques will really help.