Validation - A Short Post
I have come across a lot of people in the field of data science, who believed that data for modeling should be split only into 2 parts, the training and testing.
When I said 3 parts, they were mostly like 'oh okay!, what's the third one?' That's the objective of this short post, to understand the 3rd split, how it is different and why it is important.
Let me tell the problem in the form of a story of 3 friends. James, John and Orion, first two sound familiar? Well I am as less creative probably as the rest of the English parents, those are the most common English names, but Orion?? that's less common, read on...
For this example we would consider Orion smarter than the rest, we'll come to know why soon. James and John were given a task to build a model to explain housing prices
James started to build a regression model of all the data that was given to him, does all the fancy statistical stuff getting an amazing R-square of 0.92 (rarely though in practical examples), he jumped and leaped in joy. He was happy and satisfied and submitted his model with a bask of pride.
John at the same decided to split his dataset into a training and testing set with an 80 20 ratio. He followed James and built a similar model attaining an accuracy of 0.86 on the training data. Given he was satisfied with the achieved R-square, he applied the model on the testing data and got an R-square of 0.65. John started tweaking the training model till he achieved the training R-square of 0.82 and testing R-square of 0.79, believing his model generalizes well and submitted the model. John then shared with James the importance of having a testing set. He explained how he could improve his training model and prevent overfitting by understanding how it would behave on unseen data.
Orion came in and asked James and John respectively about what they think how well their model would explain in real life. James was less sure after John's shared idea, while John was confident that it would be about 80%. Orion worked out the same and said it wouldn't be more than 70% and when the model got deployed John's model indeed achieved an R-square of between 68% and 71%. So how did Orion figure this out and why was John wrong?
When James was tweaking his training model based on the feedback of the testing data, he didn't realize that unknowingly he was introducing biases in the model, as if the model has seen the testing data. What Orion did was he split the data into three parts, trained the model on the first, tweaked based on the second but determined the model accuracy by running the model on the third without any further tweaking.
While this post is not intended to make sure accuracy across splits be as close as possible, it's more to make readers understand what to expect. All, James, John and Orion are equally smart statisticians but Orion is the yogi who can see the future of his models because of the smart three split.
WOW! Very nice post, Anurag Halder. Storytelling is central to the human experience. I'm a big fan of your data-driven storytelling style with touches of humor that you include in your stories. I enjoyed reading your post today after a long time. Keep writing such amazing posts...good luck.