Uses of Partial Correlation
There are many ways to accomplish our goal to develop a simplest predictive model. A common and easiest approach is to remove all non-significant predictors (high p-values) from the full model while keeping an eye on the increasing R-square and the reducing RMSE (Root Mean Square Error). However, the problem arises when the significance of one predictor depends on the other predictors that are in the model. Highly correlated variables make it difficult to determine which predictor to be removed from the model to accomplish our purposes. Shedding light on this topic is the goal of this article. As always the immediate audience is me.
This is where, in our story, we call upon the powers of our protagonist, the Partial Correlation. Partial Correlation is a measure of the association between two continuous variables while controlling for the effect of one or more covariates.
For Example -
Exhibit 1 -
Partial Correlation will allow us to see correlations between each predictor and the response, after adjusting for other predictors. Here we are controlling for the effect of price on all variables. We see Baths, Square Feet, Miles to Resort, and Acres have a higher partial correlation with Price (Exhibit 1 - Partial Corr). While DoM, Beds, Cars, and Years Old have a lower partial correlation with the response. We also see DoM has the highest p-value (0.9249) as shown in Exhibit 2.
Exhibit 2 -
Let's begin by removing DoM variable from our model because of its high p-value and low partial correlation with price.
Exhibit 3 -
As we can see, upon removing DoM from our model our RMSE has reduced from 64.52 to 63.65(exhibit 3).
We will now remove the next variable with high p-value - Beds with p-value 0.5622 and then followed by Cars and Years Old. It's important to note that non-significant variables need to be removed one at a time because each removal will affect the significance of other variables. After removal of non-significant variables, a p-value of Square Feet reduces to 0.0240, statistically significant value. We are now left with those variables - Baths, Square Feet, Miles to Resort, Acres - that have a highest partial correlation with Price.
This approach in removing variables from the model gives backbone and reasoning into our thought process in simplifying the model and in addition, this approach provides better visibility into how to modify a model when the significance of one predictor depends on the other predictors that are in the model.
Analytics tool and Data Source: JMP