Validation - A Short Post

Anurag Halder

Published Feb 1, 2018

I have come across a lot of people in the field of data science, who believed that data for modeling should be split only into 2 parts, the training and testing.

When I said 3 parts, they were mostly like 'oh okay!, what's the third one?' That's the objective of this short post, to understand the 3rd split, how it is different and why it is important.

Let me tell the problem in the form of a story of 3 friends. James, John and Orion, first two sound familiar? Well I am as less creative probably as the rest of the English parents, those are the most common English names, but Orion?? that's less common, read on...

For this example we would consider Orion smarter than the rest, we'll come to know why soon. James and John were given a task to build a model to explain housing prices

James started to build a regression model of all the data that was given to him, does all the fancy statistical stuff getting an amazing R-square of 0.92 (rarely though in practical examples), he jumped and leaped in joy. He was happy and satisfied and submitted his model with a bask of pride.

John at the same decided to split his dataset into a training and testing set with an 80 20 ratio. He followed James and built a similar model attaining an accuracy of 0.86 on the training data. Given he was satisfied with the achieved R-square, he applied the model on the testing data and got an R-square of 0.65. John started tweaking the training model till he achieved the training R-square of 0.82 and testing R-square of 0.79, believing his model generalizes well and submitted the model. John then shared with James the importance of having a testing set. He explained how he could improve his training model and prevent overfitting by understanding how it would behave on unseen data.

Orion came in and asked James and John respectively about what they think how well their model would explain in real life. James was less sure after John's shared idea, while John was confident that it would be about 80%. Orion worked out the same and said it wouldn't be more than 70% and when the model got deployed John's model indeed achieved an R-square of between 68% and 71%. So how did Orion figure this out and why was John wrong?

When James was tweaking his training model based on the feedback of the testing data, he didn't realize that unknowingly he was introducing biases in the model, as if the model has seen the testing data. What Orion did was he split the data into three parts, trained the model on the first, tweaked based on the second but determined the model accuracy by running the model on the third without any further tweaking.

While this post is not intended to make sure accuracy across splits be as close as possible, it's more to make readers understand what to expect. All, James, John and Orion are equally smart statisticians but Orion is the yogi who can see the future of his models because of the smart three split.

Badrinath Vankadari (Badri) 8y

WOW! Very nice post, Anurag Halder. Storytelling is central to the human experience. I'm a big fan of your data-driven storytelling style with touches of humor that you include in your stories. I enjoyed reading your post today after a long time. Keep writing such amazing posts...good luck.

Validation - A Short Post

Anurag Halder

More articles by Anurag Halder

Others also viewed

Data Science Is Like a Team Sport - You Need the Team, Strategy, Execution, Process and Collaboration to Be Successful

What's the best way to continue cleaning a CSV file for EDA (exploratory data analysis)? Part 2

Datathon Guerrilla Manual

Bayesian Statistics and Data Analysis

Data Science for the Rest of Us!

Preprocessing in Data Science

My take on the importance of Exploratory Analysis in solving a data science problem

Simplifying Regression Analysis: A Layman’s concept

Machine Learning Intuition for Beginners

Steps Involved In Data Science Problem:

Understanding Overfitting In Predictive Analytics

How To Fine-Tune AI Models On Small Datasets

Sharing Data Responsibly In AI Model Training

Overcoming Data Limitations In AI Model Development

The Importance Of Cross-Validation In Machine Learning

Explore content categories

More articles by Anurag Halder

I can barely 'RECALL' with enough 'PRECISION' and little 'SPECIFICITY' what is 'SENSITIVITY'!

Learning by Doing

A Basic Emotion Detection Model - Application

A Basic Emotion Detection Model

Linear Classifiers - Perceptrons

Maximum Likelihood Estimation

From Johnny English to James Bond

Not that Naive after all ! Text Mining for Beginners

R and Google Analytics: Link Them Up - Step 1