Multicollinearity in Machine Learning

Rachit Singh

Published Mar 5, 2023

In a Machine Learning problem we have 2 types of features : Independent (X) or predictors and Dependent (Target/Y). When 1 independent feature(X1) is highly correlated with the other independent feature(X2) , than this situation is called as Multicollinearity. Here correlation means when there is a change in 1 feature than it affects the other feature as well. Correlation can be both positive as well as negative. For example , when X1 is increasing then X2 is also increasing shows the positive correlation whereas when X2 is decreasing than it is a case of Negative correlation.

Now the Question arises why Multicollinearity is Bad ?

Let me explain you by taking an example of Linear Regression. We know that equation for linear regression is : y=ax1+bx2+c , where 'y' is the target feature , x1 and x2 are independent features and 'a', 'b' are the coefficients of x1 and x2 respectively and c is the constant/Bias. Now Linear Regression works on an assumption that 'a' is the amount of change in 'y' , when the value of x1 changes by 1 unit by considering all other variables as constant. In the same way b is the amount of change in y when the value of x2 changes by 1 unit by keeping all other variables constant. So if there is multicollinearity between x1 and x2 than change in one affects the other , which means it is violating the Linear Regression assumption and hence Model will not give you results as expected, and also we can't tell which variable/feature is contributing more in doing the prediction. To make it more clear , suppose you are taking coachings for 5 different subjects from 5 different teachers than after your final result based on your marks in each subject you can tell each teacher's contribution in getting you passed (Keeping your self study as constant) . Now if suppose you are taking 2 coachings of Math only from two different teachers than your final marks can't tell which teacher's contribution was more.

Again on more Question arises , Is Multicollinearity always Bad ?

The Answer is , NO!!. When your aim is to do the predictions only than you can go with Multicollinearity , But if your aim is to know which feature is contributing more or which feature is more important than Multicollinearity is a problem and you need to remove it.

Recommended by LinkedIn

Day - 2 Probability in Machine Learning

Mrityunjay Pathak 2 years ago

Machine Learning Implementation of SHAP

Vikram Dev 1 year ago

Types of machine learning

Pavan Kumar 2 years ago

Its always good to remove the Multicollinearity.

Different ways to detect the Multicollinearity in your data?

Domain Knowledge
Scatter plot between 2 features
correlation matrix / Heatmap
VIF (Variance Inflation Factor) - Most used method

How to Remove the Multicollinearity ?

Remove one out of the two correlated features. - Most used method (for both regression/classification problem)
Perform Ridge/Lasso Regression. (Mainly for Regression Problem)
Using PCR (Principle Component Regressor)

Does Multicollinearity affects all ML Algorithms ?

It does not affect Tree based algorithms , but affects the Parametric algorithms like linear , logistic , Naive Bayes etc.

Mohammad Faiz 3y

thanks a lot rachit

1 Reaction

To view or add a comment, sign in

Multicollinearity in Machine Learning

Rachit Singh

Recommended by LinkedIn

More articles by Rachit Singh

Others also viewed

Bias and Variance in Machine Learning

Choosing the Right Metrics: Recall, Precision, PR Curve and ROC Curve Explained

Linear Regression in Machine Learning

The Power of Straight Line in Machine Learning

Machine Learning: Cost Functions and Gradient Descent

Machine Learning and the curse of randomness

Machine Learning: Dimensionality

Introduction to machine learning: k-nearest neighbors

Single Variant Linear Regression with example

Understanding MSE, RMSE, MAE, and R² Score in Machine Learning Model Evaluation

Explore content categories

Recommended by LinkedIn

More articles by Rachit Singh

GANs (Generative Adversarial Networks).

What is Stratified Shuffling?

Others also viewed

Bias and Variance in Machine Learning

Choosing the Right Metrics: Recall, Precision, PR Curve and ROC Curve Explained

Linear Regression in Machine Learning

The Power of Straight Line in Machine Learning

Machine Learning: Cost Functions and Gradient Descent

Machine Learning and the curse of randomness

Machine Learning: Dimensionality

Introduction to machine learning: k-nearest neighbors

Single Variant Linear Regression with example

Understanding MSE, RMSE, MAE, and R² Score in Machine Learning Model Evaluation

Explore content categories