Multicollinearity in Machine Learning
In a Machine Learning problem we have 2 types of features : Independent (X) or predictors and Dependent (Target/Y). When 1 independent feature(X1) is highly correlated with the other independent feature(X2) , than this situation is called as Multicollinearity. Here correlation means when there is a change in 1 feature than it affects the other feature as well. Correlation can be both positive as well as negative. For example , when X1 is increasing then X2 is also increasing shows the positive correlation whereas when X2 is decreasing than it is a case of Negative correlation.
Now the Question arises why Multicollinearity is Bad ?
Let me explain you by taking an example of Linear Regression. We know that equation for linear regression is : y=ax1+bx2+c , where 'y' is the target feature , x1 and x2 are independent features and 'a', 'b' are the coefficients of x1 and x2 respectively and c is the constant/Bias. Now Linear Regression works on an assumption that 'a' is the amount of change in 'y' , when the value of x1 changes by 1 unit by considering all other variables as constant. In the same way b is the amount of change in y when the value of x2 changes by 1 unit by keeping all other variables constant. So if there is multicollinearity between x1 and x2 than change in one affects the other , which means it is violating the Linear Regression assumption and hence Model will not give you results as expected, and also we can't tell which variable/feature is contributing more in doing the prediction. To make it more clear , suppose you are taking coachings for 5 different subjects from 5 different teachers than after your final result based on your marks in each subject you can tell each teacher's contribution in getting you passed (Keeping your self study as constant) . Now if suppose you are taking 2 coachings of Math only from two different teachers than your final marks can't tell which teacher's contribution was more.
Again on more Question arises , Is Multicollinearity always Bad ?
The Answer is , NO!!. When your aim is to do the predictions only than you can go with Multicollinearity , But if your aim is to know which feature is contributing more or which feature is more important than Multicollinearity is a problem and you need to remove it.
Recommended by LinkedIn
Its always good to remove the Multicollinearity.
Different ways to detect the Multicollinearity in your data?
How to Remove the Multicollinearity ?
Does Multicollinearity affects all ML Algorithms ?
It does not affect Tree based algorithms , but affects the Parametric algorithms like linear , logistic , Naive Bayes etc.
thanks a lot rachit