Is validation set in machine learning always helpful?

I have been studying data science for couple of years. A question always stays in my mind, is a validation set always helpful? Or a more positive question,is there a general way we can design a helpful validation set?

Let me briefly introduce the background. When we do machine learning algorithm, the data science "bible" says we should have three data set: training set, validation set and test set. We train the model using training set and tuned the parameter using validation set and test its performance using test set.

In my point of view, here validation set actually plays part of the role of training, since hyper parameter is also part of the model. Looking back why we need validations set, we want to avoid over-fitting, right? What does over-fitting mean? It means the test performance is bad on a "good" model. Can we really avoid that by changing one-stage training to a two-stage training?

Consider the extreme case that training set has exactly the same data as validation set or validation set is totally some bad noise, the validation step can do no help in the first case and even deteriorate the model in the second case.

Some debate of some other reasoning:

Someone says we have the assumption that validation set has the same distribution as the test set.---Does this assumption really make sense? If this is acceptable, why don't we assume training set is also from the same distribution of test set?

Someone says we can use cross validation. Cross validation can really fill the gap between training and test performance? No matter how you play with the training(you can call it training+validation if you like) data, we are just changing the objective function of the optimization problem.

Someone says we have seen good empirical performance. Yes, I agree. But, we should still call it a possible helpful technique, but not a "have to be done". I think whether validation step could be helpful really depends on the structure of the data and how we select validation set in the data collection step.

Thus, the meaningful question here would be what is the best way to select validation set in the sense at least we can get some theoretical guarantee. Randomly get 40% is obviously not the best option.

Welcome any discussion or any sharp criticism. Let's learn together.

Interestingly, I don't think most machine learning practitioners really separate a validation set from a test set - the validation set and test set are the same, although statisticians would find this quite bad. There's some recent work by Kenji Kawaguchi and Yoshua Bengio that attempt to investigate the question of generalization and prove a generalization bound based on validation/test error: https://arxiv.org/pdf/1710.05468.pdf. It may be of interest to you.

Like
Reply

To view or add a comment, sign in

Others also viewed

Explore content categories