How to avoid multicollinearity in data science?

In most datasets or data science problems, multicollinearity is an inevitable issue as most independent variables/predictor variables are correlated.

It is a pre-requisite to build most classification models such as Linear Regression, Logistic Regression and Naïve Bayes Theorem. Let us quickly review the assumptions for these models to understand the usability of each model.

Linear Regression:

·     Linearity between predictor variables Xi and target variable Y

·     Normality of Xi variables

·     Homoscedasticity of error term

·     Normality of error term

·     No multi-collinearity

Logistic Regression:

·     Linearity between the logit of probability of an outcome and its Xi predictors

·     Independence of observations

·     Large dataset

·     No multi-collinearity

Naïve Bayes Theorem:

·     The Xi are mutually exclusive which implies no multi-collinearity

So, let’s come to our problem statement of avoiding multi-collinearity in real world case studies.

  • We can go with other classification techniques such as Decision Tree, (although over-fitting is a major issue, it can be very useful to analyze previous data more than as a predicting model) Random Forest (overcomes overfitting but suffers from the problem of interpretability) and K-Nearest Neighbors wherein there are no assumptions with regard to its variables and it is even outlier proof.
  • Given that we are using Linear Regression model we may use scatterplot/pairplot to remove unwanted variables. To be more precise, we can use dimensionality reduction techniques first such as Principal Component Analysis and Linear Discriminant Analysis. These can provide the important features so we can go ahead and eliminate the unimportant Xi variables which do not add to the regression model and are therefore redundant to provide any influence on Y. Even though these techniques, to be effective also need to be void of multi-collinear variables, they can give a break through by removing some of Xi variables
  • Given that we are using Logistic Regression model, other than PCA, we can use chi-square test

To view or add a comment, sign in

More articles by Pratibha Manda

  • Tips for creating predicting models for real data :

    Problem Statement: The goal of the problem is to predict whether a passenger was delighted considering his/her overall…

    1 Comment
  • Logistic regression

    When we want to find a relationship between a dependent variable Y and its dependent variables Xi where the dependent…

    1 Comment

Explore content categories