Indicators of Multicollinearity
As always the immediate audience for this article is myself. I don’t think there is no shortage of supply of information on this topic. But, however I believe recreating a wheel is still a good learning experience and improves understanding for the time invested.
Goal: Effects and Identification of Multicollinearity.
Strong correlations among independent variables is called multicollinearity.
This strong correlations can among the predictor variables can cause problems in multiple regression analysis because it can make it difficult to identify the unique relation between each predictor variable and the dependent variable.
For example -
The house prices in a resort area, understandably without regard to each other dependence, have negative coefficients (-0.53 and -0.63) to variables Miles to Resort and Miles to Base. However, both these variables have strong correlations of 0.948 to each other . Hence making it difficult to see the independent variable's relative importance explaining the variance caused by the dependent variable.
Regression Model for variables Miles to Resort and Miles to Base, without regard to each other, are highly significant with p < 0.001
However, a combined regression (combining the influence) will not give us R-Square of 0.60 rather give us the following information, with R-Square 0.43
In a combined regression, we can see coefficient of Miles to Resort became positive. How can there be a positive slope in this case, indicating increase in price with increased distance from resort? In addition, this variable became statistically insignificant. With VIF (variance inflation factor) of 9.875, a rule of thumb - VIF of greater than 5 is often indicates collinearity problem.
Hence, when creating a multiple regression model, identify multicollinearity and eliminating one of the variables allows the other to remain statistically significant.
Data Source and Analytic Tool: JMP
I'm guessing Miles to Base have more explanatory power 🤔🤔🤔