Multiple Regression in R: Multiple Variables, Interactions, Graphing, and Assessment

Multiple Regression in R: Multiple Variables, Interactions, Graphing, and Assessment

(Models image taken from http://www.model-jobs.net/casting-call-audition/ethnic-models-wanted/)

"If we knew what we were doing, it would not be called research, would it?"

- Albert Einstein

Regression is one of my favorite statistical tests. Sure, I love a good chi-square, ANOVA, or t-test just as much as the next researcher, but regressions are my friends. I have run so many regressions during my past four and a half years in graduate school, that I occasionally even dream about them. What I most love about running regressions in R, is that running regressions with interactions are so easy.

This post focuses on regressions with more than one predictor. My post on bootstrapping (which you can view here: 1st Post) included instructions on running regressions with a single predictor, as well as the code for the regression checks. As the post was intended for those who have never used R before, it also included instruction on installing R and importing or adding your own data for analysis. As this post builds on that knowledge, I recommend you read it over if you have not already done so.

Regression Recap

For those who need a quick recap on regression, the next few of paragraphs are for you. A linear regression is a statistical test that is suitable for testing whether one continuous variable—such as age—can predict the outcome of another—such as IQ. In linear regression we examine the relationship between our hypothesized predictive variables (and possibly control variables—covariates) and our outcome. The assumption is that the relationship between these variables is a straight line.

The regression determines the best fitting line that minimized the difference between the actual outcome’s values—the y values—and where we would predict these y-values to be based on the line’s formula (technically, the line minimizes the squared distances between the actual y values and the predicted ones, but that is close enough for the purposed of this post).

The regression tests whether the slope of that line is significantly different than 0. If it is not, there is no relationship. For example, if I poll 10 people of different ages asking them how much they would enjoy working an all-nighter on a scale of 1 (Not at all) to 10 (Very much), I might get the following results:

Age[22, 25, 34, 40, 45, 46, 54, 56, 61, 65]

and

All-nighter love [1, 2, 3, 1, 1, 1, 1, 1, 1, 1]. If we plot age on the x-axis, and love on the y-axis, we might not get a perfectly straight line (not everyone chose 1), but it would probably be close enough that the slope would not differ from zero.

If the line has a slope that significantly differs from 0, then the outcome can be predicted from the predictor variable, which is also known as an independent variable. The slope (a.k.a. beta) quantifies the relationship between the two.

Simple Regression Example

To illustrate all this, and provide a quick refresher of a linear regression with one predictor, let’s imagine we have 10 participants who rate how fun our game is, and we want to see whether younger participants rate it as more fun.

First, we input the data by creating two variables, “fun” and “age” (remember that in R capitalization matters!), and bind them into a dataframe:

fun <- c(7,3,6,5,10,6,5,7,4,6,6,9,2,5,7,7)

age <- c(18,54,25,43,20,23,18,25,34,29,31,23,41,37,22,18)

df <- data.frame(age, fun)

Next, we run the regression with the “lm” command, saving the regression with the name “reg1”. The predicted variable goes to the left of the tilde, and the predictor to the right of it. We end by asking R to output the information stored in “reg1” with the summary command (all these steps are described in great detail in the bootstrapping blog post):

reg1 <- lm(fun ~ age, data = df)

summary(reg1)

R provides the following output:

To help interpret these results, here is a scatterplot with the regression line graphed on it:

Underlined in yellow is the beta value, -0.13. The first thing to note about this value is that it is negative. This means that as the value of X—age—increase, y—fun ratings—decreases. This is evident from the graph as well.

The second thing to note is the value. The beta’s value of -0.13 means that every time the value of age increases by 1, the predicted value of fun decreases by 0.13. As delineated above, if the value was positive, an increase in age would predict and increase in fun.

The t-value associated with this beta is underlined in blue, while the t’s p-value is underlined in green. As p = .001, in a world where there is no relationship between age and fun, the chances of us getting these results are only once every 1000 times we run this study. Since this is so unlikely, we would conclude that these results are statistically significant.

Causation?

One important thing to remember is that just because you have continuous variables that produced a statistically significant p-value when you put into a regression equation, it does not mean that one is caused by the others. It is possible that one may be causing the other. Conversely, the two may just be related—and you may only be able to predict one from the other because the two happened to co-occur. For example, there is a significant relationship between the number of pirates and global warming (this is a fact: https://goo.gl/QzvruV and https://goo.gl/ekBiXE). However, becoming a pirate probably is not the solution to global warming.

Multiple Regression

Up until now we only looked at very simple regressions. Sometimes we have a few predictors, and we would like to see which of those predict the outcome. This is called multiple regression. To run a multiple regression in R, simply add more predictors variable to the right of the tilde in “lm” model, and connect the variables with a “+” sign.

To further clarify the issue, let’s create a dataset to play with. As in the bootstrapping post, we will first create our variables, and then bind them into a dataframe. We will create three predictors with made up data—age, karaoke (how much someone loves karaoke), and gender (male = 0)—as well as an outcome variable—socialness (how social a person is):

age <- c(27, 12, 13, 14, 15, 26, 17, 18, 9, 20, 21, 32, 23, 44, 25, 26, 27, 38, 29, 50)

karaoke <- c(77, 32, 17, 17, 62, 21, 33, 82, 17, 14, 17, 75, 74, 56, 55, 68, 37, 89, 37, 63)

gender <- c(1, 1, 0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1)

socialness <- c(22, 16, 23, 21, 23, 27, 21, 22, 22, 28, 28, 27, 27, 27, 25, 26, 27, 32, 33, 29)

df <- data.frame(socialness,age,karaoke,gender)

Now that we have our data, we create a linear model, and call it “reg1”, and then ask for its output with the “summary” command:

reg1 <- lm(socialness ~ age + karaoke + gender, data = df)

summary(reg1)

We can see in the output that each predictor is listed with its own beta, t-value, and p-value. Both age and gender are significant below the p < .001 level. As gender is negative, men are less social, and women are more. As age is positive, being older predicts being more social. The p-value associated with karaoke is insignificant at .52.

Prior to running quickly through the regression checks that are described in the bootstrapping post, we will extract and save the residuals from the regression model, and order the dataset. However, unlike the example in the bootstrapping post that included a single predictor, in a multiple regression we order the data by the fitted values. Therefore, similar to extracting and saving the residuals from the model, we extract and save the fitted values, and then order the dataframe according to those. As I previously described, I prefer doing this with a copy of my dataframe.

First, I will create a duplicate of the dataframe. I will then extract the residuals (which I will call “resid”) and fitted values (which I will call “datanewvar”), and save them. Finally, I will order the duplicate dataframe according to the fitted values (for more information on these steps please see the previous post):

df1 <- df

df1$resid <- reg1$residuals

df1$datanewvar <- reg1$fitted.values

df1 <- df1[order(df1$datanewvar), ]

We can now assess normality with the “qqPlot” command (remember to load the “car” library first). Next, we will assess autocorrelation/independence with Durbin Watson and “acf”, and heteroskedasticity (constancy of the variance in residuals) with a Breusch-Pagan test (requiring the “lmtest” library):

library(car)

qqPlot(reg1)

durbinWatsonTest(reg1)

acf(reg1$residuals)

library(lmtest)

bptest(reg1)

Since I describe in detail assessing the checks in the previous post, I will skip them here for brevity sake. However, if you run them, you will see that the model passes all the tests.

Interactions

The final type of regression I will address in this post, is a regression with an interaction in it. An interaction occurs when the effect of two (or more) variables on one another is more than the simple additive or subtractive effect of each of them separately. For example, my life can be pretty stressful. On a day like today, I would say that on a 1 to 10-point scale, my mellowness is at a 2. Another fun fact about me, is that I really like both ice cream and gummies (A LOT). Give me either of those, and my mellowness increases (albeit temporarily) to about a 7. Give me both of those in one, and my mellowness is at a 9.

In this example we have four possible conditions: no ice cream, no gummies; no ice cream, yes gummies; yes ice cream, no gummies; and yes ice cream and yes gummies. Since either ice cream or gummies can increase my mellowness, we would expect more mellowness when I had both, and less when I had neither. However, we would expect those changes to occur by the same amounts (albeit in opposite directions). Since the mellowness level when I have neither produces a deviation from either the just gummies or just ice cream conditions that is a deviation that is so much larger compared to when I have both, we have an interaction. 

For our interaction data, we will examine whether the amounts of ice cream (measured in cups) and gummies that people had over the course of a month interact to predict how mellow they are. We will also control for whether they had sprinkles, as that can make all the difference:

gummies <- c(11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30)

icecream <- c(77, 32, 17, 17, 62, 21, 33, 82, 17, 14, 17, 75, 74, 56, 55, 68, 37, 89, 37, 63)

sprinkles <- c(1, 1, 0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1)

mellow <- c(22, 16, 23, 21, 23, 27, 21, 22, 22, 28, 28, 27, 27, 27, 25, 26, 27, 32, 33, 29)

df <- data.frame(mellow,gummies,icecream,sprinkles)

To test an interaction, we need have more than just the cross-product (multiplying the variables together) of the two variables in the model as a predictor. An interaction means that the cross-product term is predicting the outcome beyond each of the variables on their own. Therefore, we must put each of the variables on their own in the model as well. These are referred to as the main effects.

To enter a cross-product term into a linear model in R, simply connect the variables with a “:” instead of a “+”. To add a control—a covariate—add the variable into model, connecting it with a “+”. Therefore, if we wanted to compute a model that contained the interaction of ice cream and gummies while controlling for sprinkles, and retrieve the results, we would input the following:

reg1 <- lm(mellow ~ gummies : icecream + gummies + icecream + sprinkles, data = df)

summary(reg1)

We could abbreviate this model by replacing the “:” with a “*”. Utilizing an “*” in a linear model in R, instructs R to add the main effects to model automatically. This might not save a lot of time when you have a 2-way interaction with only two variables. However, in a 3-way interaction you need to enter all the 2-way interactions as well when computing the model, and by having R do it for you automatically, you can save yourself a small headache.

Therefore, an equivalent model to the one above would be:

reg1 <- lm(mellow ~ gummies * icecream + sprinkles, data = df)

summary(reg1)

If you run them both you should obtain the same exact result both times.

Studying the output of this model, we see that the interaction of age and karaoke is significant at p = .02 (about .018). Quickly run the regression checks, and confirm that the model passes them all (it does).

You might be tempted to interpret the main effects with the interaction term in the model. Resist the temptation and DO NOT DO IT! If you would like to interpret the significance of the main effects, you need to rerun the model without the interaction. Remember: if a variable is part of an interaction, do not consider its main effect unless you re-run the model where it is not part of an interaction.

Graphing an Interaction

To understand the nature of the interaction, you need to graph it. When graphing a 2-way interaction in a linear model, I like to utilize the Visreg package. Let’s start by installing and loading the package:

install.packages(“visreg”)

library(visreg)

The beauty of visreg is that it graphs you continuous 2-way interaction in a single line of code. Even more impressive, it throws in confidence intervals for fun. The command to crate the graph is “visreg”. Much like the regression code, the details of command go in the ().

The first spot is the model name. Following the model, the two variable the make up the interaction are listed. The first predictor listed (in “”) will be placed on the x-axis. The second variable, which is listed following a “by=” switch (also encased in “”), will be graphed as the lines in the graph. To determine the levels of the second variable to graph, insert into the “breaks” switch the variables you are interested in.

In psychology, we frequently plot one standard deviation above and below the mean. These can be obtained by requesting that R extract a variable’s mean (after specifying the dataframe that it is located in), and then add a SD to the mean once, and subtract a SD for it. Here, we will graph ice cream on the x-axis, and gummies as the lines. Therefore, we compute the break values for gummies:

mean(df$gummies) + sd(df$gummies)

mean(df$gummies) - sd(df$gummies)

The “overlay” switch should be set to TRUE, and the “partial” switch should be set to FALSE. Together, the plotting code will appear as follows:

visreg(reg1, "icecream", by="gummies", , breaks=c(14.58392,26.41608), overlay=TRUE, partial=FALSE)

This code produces the following graph:

To interpret the graph, let’s begin by focusing on what happens when someone has relatively little ice cream—the left end of the graph. Here, there is a large discrepancy between the people who have a small number of gummies (represented by the red line), and those with a large amount (the blue line). When ice cream amounts are low, and gummies amounts are low, mellowness is very low—much like me! Conversely, when ice cream is low and gummies are high, mellowness is pretty high.

Focusing on the right-hand part of the graph—where people had a lot of ice cream—we see a different pattern of results. Here, having a lot of gummies is important for mellowness, but it seems that regardless of gummies amounts mellowness is pretty high.

Taken together, these results me that if you have only a little bit of ice cream, having gummies can make all the difference for mellowness. On the other hand, if you have a lot of ice cream, gummies will not make that much of a difference. Having a different pattern of results from one side of the graph to the other is the meaning of an interaction. It means that one level of one variable is dependent on the level of the second variable to predict the outcome.

Evaluating Models

The last topic I would like to touch upon for this blog post is alternative methods to null hypothesis testing (a.k.a. the p-value) of predictors coefficients for evaluating their importance in the model. To illustrate the possibilities, we will focus the first multiple regression model we constructed where we tried to predict socialness:

age <- c(27, 12, 13, 14, 15, 26, 17, 18, 9, 20, 21, 32, 23, 44, 25, 26, 27, 38, 29, 50)

karaoke <- c(77, 32, 17, 17, 62, 21, 33, 82, 17, 14, 17, 75, 74, 56, 55, 68, 37, 89, 37, 63)

gender <- c(1, 1, 0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1)

socialness <- c(22, 16, 23, 21, 23, 27, 21, 22, 22, 28, 28, 27, 27, 27, 25, 26, 27, 32, 33, 29)

df <- data.frame(socialness,age,karaoke,gender)

reg1 <- lm(socialness ~ age + karaoke + gender, data = df)

summary(reg1)

Let’s focus on the gender variable. One way to assess its importance as a predictor is to examine its p-value, which at <.001 is highly significant. Another way to decide whether to keep it in the model as a predictor is by examining the Akaiake Info Criterion, also known as the AIC.

The AIC does not evaluate thresholds. Rather, it looks to optimize the prediction (by maximizing the log likelihood). To investigate the value of a variable in the model, run the model with the variable in question, and calculate the full model’s AIC with the “AIC” command. Then, run the model without the variable, and calculated that model’s AIC. Whichever model has a lower AIC score is the better model:

reg_full <- lm(socialness ~ age + karaoke + gender, data = df)

AIC(reg_full)

reg_part <- lm(socialness ~ age + karaoke, data = df)

AIC(reg_part)

The AIC for the “reg_full” model is 99.61487, while the AIC for the “reg_part” model is 106.8578. Since the AIC for the full model is lower than the partial one, the additional variable in the model—gender—appears to be important. 

When the number of predictors in a model is roughly more than 30% of the sample size, a corrected version of the AIC—the AICc—should be employed instead. Here, our sample size is 20, and we only have 3 predictors, so there is no need to utilize the AICc. However, if we did need to utilize the AICc, we could do so easily by installing and loading the MuMin library, and then employing the “AICc” command instead of the “AIC” one.

An additional way to investigate the validity of the variable is a likelihood ratio test comparing the full and partial models directly. This test produces a p-value describing whether the two models significantly differ from each other. If the value is significant, then the additional variable in the full model is an important addition. This test utilizes the “anova” command. After typing “anova”, put in the () the full model, a comma, and the partial model:

anova(reg_full, reg_part)

With a p value of .007, this outcome agrees with all the others, and suggests that gender is an important part of this model. Sometimes, some models point in one direction more strongly than the others. At the end of the day, your decisions should be data-guided, and not mindlessly data-driven. To some extent, statistics is an art form driven by creativity and good sense.

Very interesting article 👏🏽👏🏽. Well explained and useful. Thank you for sharing.

To view or add a comment, sign in

More articles by Dr. Aria Fredman

Others also viewed

Explore content categories