Cross-Validation What is it and why use it?

Mohammed Imtiaz E

Published Jan 10, 2023

Regression and classification Machine Learning models aim to predict a value or class from the variables contained in the data. Each model has its own algorithm to try to identify the patterns contained in the data that allow an accurate prediction to be made.

The models, in addition to being accurate, must be generalist, with the ability to interpret data never seen before and reach an adequate result. One way of evaluating this generalization capacity of the models is to apply Cross-Validation.

But what is Cross-Validation?

Cross-validation is a technique used as a way of obtaining an estimate of the overall performance of the model. There are several Cross-Validation techniques, but they basically consist of separating the data into training and testing subsets.

The training subset, as the name implies, will be used during the training process to calculate the hyperparameters of the model. To calculate the generalization capacity of the model, after the training stage, the test model is used.

The performance metrics of the model such as Accuracy (classification) and Root Mean Absolute Error (regression) are calculated using the true labels from the test dataset and the predictions made by the trained model on the test data.

There are many types of Cross-Validation techniques, and in this post I will talk about three of them: Holdout, K-Fold and Leave-One-Out.

Holdout Cross-Validation

Probably the most famous type of Cross-Validation technique is the Holdout. This technique consists in separating the whole dataset into two groups, without overlap: training and testing sets. This separation can be made shuffling the data or maintaining its sorting, depends on the project.

It is common to see a 70/30 split in projects and studies, with 70% of the data being used to train the model and the remaining 30% being used to test and evaluate it. However, this ratio is not a rule and it may vary depending on the specificity of the project.

Example of the Holdout Cross-Validation applied to a dataset

In Python, the Holdout Cross-Validation is easily done using the train_test_split function from the scikit-learn library.

Using the Breast Cancer Dataset and a 70/30 split.

K-Fold Cross-Validation

Before separating the data into training and testing sets, the K-Fold Cross-Validation separates the whole data into K separated subsets with approximate size. Only then, each subset is divided into training and testing sets.

Each subset is used to train and test the model. In practice, this technique generates K different models with K different results. The final result of the K-Fold Cross-Validation is the average of the individual metrics of each subset.

It is important to notice that since the K-Fold divides the original data into smaller subsets, the size of the dataset and the K number of subsets must be taken into account. If the dataset is small or the number of K is too big, the resulting subsets may become very small.

This may result in just a few data to be used to train the models, resulting in a poor performance since the algorithm couldn’t understand and learn the patterns in the data due to lack of information.

No alt text provided for this image — Example of a 3-Fold Cross-Validation applied to a dataset

Python also has a easy way to perform the K-Fold split using the Kfold from the scikit-learn library.

Leave-One-Out Cross-Validation

The Leave-One-Out Cross-Validation consists in creating multiple training and test sets, where the test set contains only one sample of the original data and the training set consists in all the other samples of the original data. This process repeats for all the samples in the original dataset.

This type of validation usually is very consuming because if the data used contains n samples, the algorithm will have to train (using n-1 samples) and evaluate the model n times.

On the positive side, this technique, of all seen in this post, is the one in which the models used have the largest amount of samples used for training, and this may result in better models developed. Also, there is no need to shuffle the data, since all possible combinations of train/test sets will be generated.

Leave-One-Out Cross-Validation is also available at the scikit-learn library using LeaveOneOut

Using the Breast Cancer Dataset, we have:

Similar to the Holdout, the Leave-One-Out Cross-Validation is also a special type of K-Fold, where the value of K is equal to the number of samples of the dataset.

Performance comparison

To show the difference in performance for each type of Cross-Validation, the three techniques will be used with a simple Decision Tree Classifier to predict if a patient in the Breast Cancer dataset has benign (class 1) or malignant (class 0) tumor. For this comparison, a Holdout with 70/30 split, a 3-Fold and the Leave-One-Out will be used.

The code used can be found on my github page: https://github.com/Imtiaz-Storyteller/Cross_Validation

The results obtained are shown in the table below:

As expected, the run time for the Leave-One-Out was much greater than compared with the other two techniques and although it used more data to train the model, it wasn’t the best performance overall.

The best technique for this specific problem was the Holdout Cross-Validation with 70% of the data used for training and 30% used for testing the model.

Thank you for reading, I hope it was somewhat helpful to you.

Any comments and suggestions are more than welcome.

Feel free to reach me in my Linkedin and to check my GitHub.

To view or add a comment, sign in

Cross-Validation What is it and why use it?

Mohammed Imtiaz E

But what is Cross-Validation?

Holdout Cross-Validation

K-Fold Cross-Validation

Recommended by LinkedIn

Leave-One-Out Cross-Validation

Performance comparison

The code used can be found on my github page: https://github.com/Imtiaz-Storyteller/Cross_Validation

Others also viewed

k-mean clustering and its real usecase in the security domain

Building 10 Classifier 🍌🍊Models in Machine Learning + Notebook

Unlocking Insights with Decision Trees in Machine Learning

ML - Pipeline

🔍 What Is Symbolic Regression — and Why It’s a Game-Changer for Data Science

What Did I learn by completing a "Diabetes Predicting ML Project" found on YouTube

Evaluating a Machine Learning Model using a ROC Curve

Why, How and When to Scale your Features?

Being more careful in statistics and machine learning - part 1

The Importance Of Cross-Validation In Machine Learning

Linear Regression Models

How to Optimize Machine Learning Performance

Supervised Learning Techniques

Explore content categories

But what is Cross-Validation?

Holdout Cross-Validation

K-Fold Cross-Validation

Recommended by LinkedIn

Leave-One-Out Cross-Validation

Performance comparison

The code used can be found on my github page: https://github.com/Imtiaz-Storyteller/Cross_Validation

Others also viewed

k-mean clustering and its real usecase in the security domain

Building 10 Classifier 🍌🍊Models in Machine Learning + Notebook

Unlocking Insights with Decision Trees in Machine Learning

ML - Pipeline

🔍 What Is Symbolic Regression — and Why It’s a Game-Changer for Data Science

What Did I learn by completing a "Diabetes Predicting ML Project" found on YouTube

Evaluating a Machine Learning Model using a ROC Curve

Why, How and When to Scale your Features?

Being more careful in statistics and machine learning - part 1

Similar topics

The Importance Of Cross-Validation In Machine Learning

Linear Regression Models

How to Optimize Machine Learning Performance

Supervised Learning Techniques

Explore content categories