Cross-Validation What is it and why use it?
Regression and classification Machine Learning models aim to predict a value or class from the variables contained in the data. Each model has its own algorithm to try to identify the patterns contained in the data that allow an accurate prediction to be made.
The models, in addition to being accurate, must be generalist, with the ability to interpret data never seen before and reach an adequate result. One way of evaluating this generalization capacity of the models is to apply Cross-Validation.
But what is Cross-Validation?
Cross-validation is a technique used as a way of obtaining an estimate of the overall performance of the model. There are several Cross-Validation techniques, but they basically consist of separating the data into training and testing subsets.
The training subset, as the name implies, will be used during the training process to calculate the hyperparameters of the model. To calculate the generalization capacity of the model, after the training stage, the test model is used.
The performance metrics of the model such as Accuracy (classification) and Root Mean Absolute Error (regression) are calculated using the true labels from the test dataset and the predictions made by the trained model on the test data.
There are many types of Cross-Validation techniques, and in this post I will talk about three of them: Holdout, K-Fold and Leave-One-Out.
Holdout Cross-Validation
Probably the most famous type of Cross-Validation technique is the Holdout. This technique consists in separating the whole dataset into two groups, without overlap: training and testing sets. This separation can be made shuffling the data or maintaining its sorting, depends on the project.
It is common to see a 70/30 split in projects and studies, with 70% of the data being used to train the model and the remaining 30% being used to test and evaluate it. However, this ratio is not a rule and it may vary depending on the specificity of the project.
In Python, the Holdout Cross-Validation is easily done using the train_test_split function from the scikit-learn library.
Using the Breast Cancer Dataset and a 70/30 split.
K-Fold Cross-Validation
Before separating the data into training and testing sets, the K-Fold Cross-Validation separates the whole data into K separated subsets with approximate size. Only then, each subset is divided into training and testing sets.
Each subset is used to train and test the model. In practice, this technique generates K different models with K different results. The final result of the K-Fold Cross-Validation is the average of the individual metrics of each subset.
It is important to notice that since the K-Fold divides the original data into smaller subsets, the size of the dataset and the K number of subsets must be taken into account. If the dataset is small or the number of K is too big, the resulting subsets may become very small.
This may result in just a few data to be used to train the models, resulting in a poor performance since the algorithm couldn’t understand and learn the patterns in the data due to lack of information.
Python also has a easy way to perform the K-Fold split using the Kfold from the scikit-learn library.
Recommended by LinkedIn
Using the same dataset as before, with a value of K = 3. Refer Github link given below for code.
Fundamentally, the Holdout Cross-Validation is the same as a 1-Fold Cross-Validation.
Leave-One-Out Cross-Validation
The Leave-One-Out Cross-Validation consists in creating multiple training and test sets, where the test set contains only one sample of the original data and the training set consists in all the other samples of the original data. This process repeats for all the samples in the original dataset.
This type of validation usually is very consuming because if the data used contains n samples, the algorithm will have to train (using n-1 samples) and evaluate the model n times.
On the positive side, this technique, of all seen in this post, is the one in which the models used have the largest amount of samples used for training, and this may result in better models developed. Also, there is no need to shuffle the data, since all possible combinations of train/test sets will be generated.
Leave-One-Out Cross-Validation is also available at the scikit-learn library using LeaveOneOut
Using the Breast Cancer Dataset, we have:
Similar to the Holdout, the Leave-One-Out Cross-Validation is also a special type of K-Fold, where the value of K is equal to the number of samples of the dataset.
Performance comparison
To show the difference in performance for each type of Cross-Validation, the three techniques will be used with a simple Decision Tree Classifier to predict if a patient in the Breast Cancer dataset has benign (class 1) or malignant (class 0) tumor. For this comparison, a Holdout with 70/30 split, a 3-Fold and the Leave-One-Out will be used.
The code used can be found on my github page: https://github.com/Imtiaz-Storyteller/Cross_Validation
The results obtained are shown in the table below:
As expected, the run time for the Leave-One-Out was much greater than compared with the other two techniques and although it used more data to train the model, it wasn’t the best performance overall.
The best technique for this specific problem was the Holdout Cross-Validation with 70% of the data used for training and 30% used for testing the model.
Thank you for reading, I hope it was somewhat helpful to you.
Any comments and suggestions are more than welcome.
Feel free to reach me in my Linkedin and to check my GitHub.