Multicollinearity in Machine Learning

Multicollinearity in Machine Learning

In a Machine Learning problem we have 2 types of features : Independent (X) or predictors and Dependent (Target/Y). When 1 independent feature(X1) is highly correlated with the other independent feature(X2) , than this situation is called as Multicollinearity. Here correlation means when there is a change in 1 feature than it affects the other feature as well. Correlation can be both positive as well as negative. For example , when X1 is increasing then X2 is also increasing shows the positive correlation whereas when X2 is decreasing than it is a case of Negative correlation.

Now the Question arises why Multicollinearity is Bad ?

Let me explain you by taking an example of Linear Regression. We know that equation for linear regression is : y=ax1+bx2+c , where 'y' is the target feature , x1 and x2 are independent features and 'a', 'b' are the coefficients of x1 and x2 respectively and c is the constant/Bias. Now Linear Regression works on an assumption that 'a' is the amount of change in 'y' , when the value of x1 changes by 1 unit by considering all other variables as constant. In the same way b is the amount of change in y when the value of x2 changes by 1 unit by keeping all other variables constant. So if there is multicollinearity between x1 and x2 than change in one affects the other , which means it is violating the Linear Regression assumption and hence Model will not give you results as expected, and also we can't tell which variable/feature is contributing more in doing the prediction. To make it more clear , suppose you are taking coachings for 5 different subjects from 5 different teachers than after your final result based on your marks in each subject you can tell each teacher's contribution in getting you passed (Keeping your self study as constant) . Now if suppose you are taking 2 coachings of Math only from two different teachers than your final marks can't tell which teacher's contribution was more.

Again on more Question arises , Is Multicollinearity always Bad ?

The Answer is , NO!!. When your aim is to do the predictions only than you can go with Multicollinearity , But if your aim is to know which feature is contributing more or which feature is more important than Multicollinearity is a problem and you need to remove it.

Its always good to remove the Multicollinearity.

Different ways to detect the Multicollinearity in your data?

  1. Domain Knowledge
  2. Scatter plot between 2 features
  3. correlation matrix / Heatmap
  4. VIF (Variance Inflation Factor) - Most used method

How to Remove the Multicollinearity ?

  1. Remove one out of the two correlated features. - Most used method (for both regression/classification problem)
  2. Perform Ridge/Lasso Regression. (Mainly for Regression Problem)
  3. Using PCR (Principle Component Regressor)

Does Multicollinearity affects all ML Algorithms ?

It does not affect Tree based algorithms , but affects the Parametric algorithms like linear , logistic , Naive Bayes etc.

To view or add a comment, sign in

More articles by Rachit Singh

  • GANs (Generative Adversarial Networks).

    Generative Adversarial Networks (GANs) can be broken down into three parts: Generative: To learn a generative model…

  • What is Stratified Shuffling?

    Stratified shuffling is a data preprocessing technique commonly used in machine learning to ensure that the…

Others also viewed

Explore content categories