Validation - A Short Post

Validation - A Short Post

I have come across a lot of people in the field of data science, who believed that data for modeling should be split only into 2 parts, the training and testing.

When I said 3 parts, they were mostly like 'oh okay!, what's the third one?' That's the objective of this short post, to understand the 3rd split, how it is different and why it is important.

Let me tell the problem in the form of a story of 3 friends. James, John and Orion, first two sound familiar? Well I am as less creative probably as the rest of the English parents, those are the most common English names, but Orion?? that's less common, read on...

For this example we would consider Orion smarter than the rest, we'll come to know why soon. James and John were given a task to build a model to explain housing prices

James started to build a regression model of all the data that was given to him, does all the fancy statistical stuff getting an amazing R-square of 0.92 (rarely though in practical examples), he jumped and leaped in joy. He was happy and satisfied and submitted his model with a bask of pride.

John at the same decided to split his dataset into a training and testing set with an 80 20 ratio. He followed James and built a similar model attaining an accuracy of 0.86 on the training data. Given he was satisfied with the achieved R-square, he applied the model on the testing data and got an R-square of 0.65. John started tweaking the training model till he achieved the training R-square of 0.82 and testing R-square of 0.79, believing his model generalizes well and submitted the model. John then shared with James the importance of having a testing set. He explained how he could improve his training model and prevent overfitting by understanding how it would behave on unseen data.

Orion came in and asked James and John respectively about what they think how well their model would explain in real life. James was less sure after John's shared idea, while John was confident that it would be about 80%. Orion worked out the same and said it wouldn't be more than 70% and when the model got deployed John's model indeed achieved an R-square of between 68% and 71%. So how did Orion figure this out and why was John wrong?

When James was tweaking his training model based on the feedback of the testing data, he didn't realize that unknowingly he was introducing biases in the model, as if the model has seen the testing data. What Orion did was he split the data into three parts, trained the model on the first, tweaked based on the second but determined the model accuracy by running the model on the third without any further tweaking.

While this post is not intended to make sure accuracy across splits be as close as possible, it's more to make readers understand what to expect. All, James, John and Orion are equally smart statisticians but Orion is the yogi who can see the future of his models because of the smart three split.



WOW! Very nice post, Anurag Halder. Storytelling is central to the human experience. I'm a big fan of your data-driven storytelling style with touches of humor that you include in your stories. I enjoyed reading your post today after a long time. Keep writing such amazing posts...good luck.

To view or add a comment, sign in

More articles by Anurag Halder

  • I can barely 'RECALL' with enough 'PRECISION' and little 'SPECIFICITY' what is 'SENSITIVITY'!

    I find it very difficult and unfair to remember jargon, anything that forces me to memorize generally fails me. To…

    4 Comments
  • Learning by Doing

    I always have been a strong proponent of learning by example, because I too understand by starting with a simple toy…

    4 Comments
  • A Basic Emotion Detection Model - Application

    Before I begin my current article, aiming to show a quick application use case, let me thank all my friends from the…

    14 Comments
  • A Basic Emotion Detection Model

    Very recently I was working on an open source computer vision project to classify, looking at human faces, into 7…

    26 Comments
  • Linear Classifiers - Perceptrons

    I'll try and handhold you gently into the world of classification, using one of the simplest approaches yet effective…

  • Maximum Likelihood Estimation

    This article is intended as short write up for those who are not very clear about the concept of Maximum Likelihood…

    1 Comment
  • From Johnny English to James Bond

    I am sure once in a while you must have seen a 'Missing' poster either in a tree/lamp-post/walls or some public place…

    2 Comments
  • Not that Naive after all ! Text Mining for Beginners

    I have at times been approached by fellow mates and friends, asking me about my opinion and experience of Text…

    4 Comments
  • R and Google Analytics: Link Them Up - Step 1

    I am sure most of us who have been exposed to R and Google Analytics have wished some time or the other to be able to…

Others also viewed

Explore content categories