Cross-Validation in Machine Learning

Cross-Validation in Machine Learning

Dividing the data set into a fixed training set and test set can be problematic, when results in test set being small. A small test set may produces statistical uncertainty around the estimated average test error, thus making difficult to claim which algorithm works better from the two different algorithm executed.

When the data set has hundreds of thousands of examples or more, this is not a serious issue. When the data set is too small to yield accurate estimation of generalization error because the mean of a loss L on a small test set may have too high variance. There are other methods which helps in estimation of the mean test error, at the price of increased computational cost. One method is repeating training and testing computation on different randomly chosen splits ( training & test ) of the original data set and is known as k-fold cross validation method.

k-fold Procedure

  • Original sample is randomly partitioned into k equal sized sub samples.
  • A single subsample is retained as the validation data for testing the model, and the remaining k − 1 subsamples are used as training data.
  • The cross-validation process is then repeated k times, with each of the k subsamples used exactly once as the validation data.
  • The k results can then be averaged to produce a single estimation.
  • The advantage of this method over repeated random sub-sampling (see below) is that all observations are used for both training and validation, and each observation is used for validation exactly once.
  • 10-fold cross-validation is commonly used.

To view or add a comment, sign in

More articles by Shashi Singh

Explore content categories