The Dark Side of Predictive Analytics

Benjamin Larson PhD

Published Jan 4, 2017

Predictive modeling is great. By predicting future events and behaviors, companies can increase their competitive edge and managers can make better informed decisions. Yet when there is too much faith placed in these models, and these models are left unchecked, the result can be gravely unintended consequences.

How can machines running statistical models go bad?

Meet the Feedback Loop

A feedback loop in modeling is where the results of the model are somehow fed back into the model (sometimes intentionally, other times not). One simple example might be an ad placement model.

Imagine you built a model determining where on a page to place an ad based on the webpage visitor. When a visitor in group A sees an ad on the left margin, he clicks on it. This click is fed back into the model, meaning left margin placement will have more weight when selecting where to place the ad when another group A visitor comes to your page.

This is good, and in this case - intentional. The model is constantly retraining itself using a feedback loop.

When feedback loops go bad...

Gaming the system.

Build a better mousetrap.. the mice get smarter.

Imagine a predictive model developed to determine entrance into a university. Let's say when you initially built the model, you discovered that students who took German in high school seemed to be better students overall. Now as we all know, correlation is not causation. Perhaps this was just a blip in your data set, or maybe it was just the language most commonly offered at the better high schools. The truth is, you don't actually know.

How can this be a problem?

Competition to get into universities (especially highly sought after universities) is fierce to say the least. There are entire industries designed to help students get past the admissions process. These industries use any insider knowledge they can glean, and may even try reverse engineering the admissions algorithm.

The result - a feedback loop

These advisers will learn that taking German greatly increases a student's chance of admission at this imaginary university. Soon they will be advising prospective students (and their parents) who otherwise would not have any chance of being accepted into your school, to sign up for German classes. Well now you have a bunch of students, who may no longer be the best fit, making their way past your model.

While ad clicks or college admissions are one thing, policing and criminal sentencing algorithms run the risk of being much more harmful.

Left unchecked, the feedback loop of a predictive criminal activity model in any large city in the United States will almost always teach the computer to emulate the worst of human behavior - racism, sexism, and class discrimination.

Since minority males from poor neighborhoods dis-proportionally make up our current prison population, any model that takes race, sex, and economic status into account will inevitably determine a 19 year old black male from a poor neighborhood is a criminal. We will have then violated the basic tenant of our justice system - presumption of innocence until proven guilty.

What to do?

Feedback loops can be tough to anticipate, so one method to guard against them is to retrain your model every once in a while. I even suggest retooling the model (removing some factors in an attempt to determine if a rogue factor - i.e. German class, is holding too much weight in your model).

And always keep in mind that these models are just that - models. They are not fortune tellers. Their accuracy should constantly be criticized and methods questioned.

Rick Greenfield

Principal at R&A Associates

There is a lot of good material missed, that on the predictions before the 3-11 event here in Japan. The most common explanation offered by TEPCO in the after math was that it was "sotegai" (beyond imagination) except it was not, there had been a lot of warnings and indicators in the past including geostrata that showed about 1,000 years before, something very similar had happened. Models and simulations are data driven and data is NEVER and cannot be 100 percent complete, so the best models will rise to the mid level of improbable events/ As I learned researching after 3-11 one of the great problems,and one that still exists too many places is the silo effect, political risk is here, natural disaster there, and on and on. At one time that may have worked, and certainly the tools to combine many kinds of input were lacking. Still, too little has changed too slowly.

2 Reactions

Wolfgang Hennes 9y

yes, the assumptions are crucial, thank you for sharing. Sounds a little like HAL.

The Dark Side of Predictive Analytics

Benjamin Larson PhD

Meet the Feedback Loop

When feedback loops go bad...

What to do?

More articles by Benjamin Larson PhD

Others also viewed

Hyperparameter Optimization

Machine Learning - Cross Validation

Machine Learning Insights: Decoding Cigarette Pricing Trends in Pennsylvania (2015–2019)

Anomaly Detection Techniques in Machine Learning

Predicting the Estimated Ultimate Recovery with Regression Analysis

Demystifying ROC and AUC: Essential Metrics for Binary Classification

Dimension Reduction in Machine Learning. Why PCA?

Understanding Concept Drift and Data Drift for Robust Machine Learning Models

COVID-19 – The Issue is Not the Modeling !

Ethical Considerations In Predictive Analytics With AI

Understanding Overfitting In Predictive Analytics

Predictive Analytics in Advertising

Understanding Model Drift In Machine Learning Applications

The Impact Of Data Privacy On Predictive Modeling

Explore content categories

Meet the Feedback Loop

When feedback loops go bad...

What to do?

More articles by Benjamin Larson PhD

Ensemble Modeling: A Simple Explanation for Beginners

What Exactly is Big Data?

How to Lie with Statistics: Even if You Don't Mean To

Using Qlik to Battle Alarm Fatigue in Hospitals

Build a Clinical Engineering Dashboard with QLIK Sense

Detecting Fraud in Your HTM (Clinical Engineering) Department

Intro to Machine Learning: K Nearest Neighbor

Automating Data Preparation with SQL Server

HTM: Visualizing Inspection Schedule Balancing

Data Cleaning with Excel

Others also viewed

Hyperparameter Optimization

Machine Learning - Cross Validation

Machine Learning Insights: Decoding Cigarette Pricing Trends in Pennsylvania (2015–2019)

Anomaly Detection Techniques in Machine Learning

Predicting the Estimated Ultimate Recovery with Regression Analysis

Demystifying ROC and AUC: Essential Metrics for Binary Classification

Dimension Reduction in Machine Learning. Why PCA?

Understanding Concept Drift and Data Drift for Robust Machine Learning Models

COVID-19 – The Issue is Not the Modeling !

Similar topics

Ethical Considerations In Predictive Analytics With AI

Understanding Overfitting In Predictive Analytics

Predictive Analytics in Advertising

Understanding Model Drift In Machine Learning Applications

The Impact Of Data Privacy On Predictive Modeling

Explore content categories