Bootstrapping (and simple regression) in R | Part Two: How do I bootstrap in R?
“Essentially, all models are wrong, but some are useful”
- George Box
We now know why bootstrapping is awesome (and if you would like to read further about it, I recommend the first few pages of this article), but the question of how to perform a bootstrap remains. Although there are packages that allow you to input minimum amounts of code to perform a bootstrap, in this blog I focus on bootstrapping a longer way.
One reason I prefer the method that requires writing your own function, and inputting more information, is that viewing and manipulating the components of bootstrapping deepens an understanding of the process (learning by doing is always best). The second reason is more pragmatic; if you know how to bootstrap manually you can then do so with almost any model or metric. However, if you are relying on a more restricted package to bootstrap you are limited by what it allows you to do. You might want to bootstrap a median, a robust regression, or a moderated mediation, and for that you may have to understand how to bootstrap in a more manual way. Knowledge is power, and all that power makes you a numbers bada$$! Go you!
Since this post is intended for those who may have no experience with R, I begin with explaining how to install the program, and enter (or import) your data before moving on to explaining what a function is, and how to bootstrap.
To download R, download the installer on this page: http://cran.stat.ucla.edu/ , and then run it and follow the instructions. After installing R, I recommend you install RStudio as well. RStudio is a free program that runs R with a better interface. To install RStudio, download the installer here: https://www.rstudio.com/products/rstudio/ , and then run it and follow its instructions. You are now halfway there to becoming a R ninja!
RStudio looks like this:
You can write your code either in the top left panel, or the bottom left panel. Additionally, the bottom left panel is where non-graphic outputs appear (those will be on the bottom right side). Once a command it typed into the bottom left panel and executed, the output (if any) will appear below it. Unlike code written in the top left panel that is saved a lot like a document—allowing easy modifications—what you type in the bottom left panel is mostly lost after running it. In short, let’s stick to code writing in the top left panel.
Importing data
Now that R is installed, we need some data to bootstrap. In R, you can enter your own data, or import it. To import data stored in a commonly used CSV type file, two lines of command must be entered; the first sets the location of your file, and second imports the file into R.
Setting file location is done with the “setwd” command, with the location path inside quotation marks in the ():
setwd("/Users/leahafredman/stats/Regression Stats")
Importing the CSV file is done similarly with the “read.csv” command. However, to work with the data, we assign the dataset a name, in this case “df”, with a “<-” pointing towards the name from the command importing the data. The file name is surrounded by quotation marks:
df <- read.csv("datafile_age_fun.csv")
To execute the code, situate your blinking cursor on the line you would like to execute, and click the “run” button in the top right corner of that pane.
Entering data
To manually enter data, you need to first choose a variable name. In this example, I will pretend I examined how much fun 16 people thought my new game “Jury Giraffe” was, and whether their rating could be predicted from their age. Thus, I will label my two variables “fun” and “age”. Note that capitalization matters, and “fun” is not the same as “Fun”.
In the top left pane, I type the following:
fun <- c(7,3,6,5,10,6,5,10,1,6,6,9,2,9,7,4)
age <- c(18,54,25,43,20,23,18,25,34,29,31,23,41,24,22,18)
You can think of the small “c” as creating a column with all those numbers in it. We assign each column of numbers to a variable name utilizing a “<-”. To confirm that your variables were saved, type the variable names, and examine the output.
Code:
fun
age
Output:
> fun
[1] 7 3 6 5 10 6 5 10 1 6 6 9 2 9 7 4
> age
[1] 18 54 25 43 20 23 18 25 34 29 31 23 41 24 22 18
Next, bind the vectors into a dataframe utilizing the data.frame command with the following code:
df <- data.frame(fun, age)
The name you wish to assign the dataframe goes on the left, and is followed by the assignment arrow, and the command. In this case, we utilized the data.frame command to bind the two variables inside the () into a single dataframe. To examine your dataframe type:
View(df)
Note that the “View” command is with an uppercase “V”, and the dataframe name goes in the (). Executing this code results in opening a new window with the dataframe. Simply close it to return to your coding frame.
Creating a regression model
Now that the two variables are created, let’s run an ordinary least squares (OLS) regression with them. Here, too, you must assign your model a name to extract information from it later. Let’s name the model “reg1”:
reg1 <- lm(fun ~ age, data = df)
The regression is run utilizing the “linear model” command, “lm”, with placing the model variables in the () that follow the command. The predicted variable goes on the left, and is followed by a tilde sign and the independent variables, a comma, and the name of the dataframe the variables are stored in after typing “data = ”.
To examine the output utilize the “summary” command, typing the model name in the ():
summary(reg1)
The coefficient for the age predictor is -0.14500, and has a t-value of -2.517, which equals a p-value of .0247. The F-statistic is as follows: 6.334 on 1 and 14 DF, p-value: 0.02466
An equivalent way to run the model would be to extract the variable from the dataframe, rather than listing the variable names, and then the dataframe in which they are located. To specify a variable in a specific dataframe, first type the dataframe name, then a dollar sign, followed by the variable name:
reg1A <- lm(df$fun ~ df$age)
If you then execute the code summary(reg1A), you should receive the same exact output as above.
The model is statistically significant at p = .02, and since the beta is negative, the older a person is, the less fun they thought the game was. Since this model only has only one predictor both the predictor’s p-value, and the model’s p-value are identical. This would likely not be the case if the model had more than one predictor.
A quick recap of null hypothesis statistical testing (NHST) for those of us who have not taken a statistics course recently, and do not quite remember what the p-value means. In NHST we examine what is the probability that we would obtain the results that we did was, if the null hypothesis is true. That is, NHST looks at whether we should reject the null hypothesis. The null hypothesis is the alternative hypothesis to the one we are testing.
In this regression example, the experimental hypothesis is that age can predict the fun score, and the null hypothesis is that age cannot predict the fun score. Therefore, a p-value of .02 means that if age really does not predict fun ratings, that we would still get a result saying that age does predict fun ratings 2 times out of 100. Another way to state that, is to say that if we ran this study 100 times, and age really does not predict fun ratings, only 2 of those times would we get a result saying that age does predict fun ratings. That means that while it is possible that the relationship that we found between age and fun is not a real one, it is pretty unlikely that we would have found it if that were the case. NOTE that this does not mean that the null hypothesis is definitely false, because this could be one of those 2 in 100 cases. The only thing we can say is that it is unlikely that we would have gotten this result if the null was true, because this would happen only 2% of the time. Thus, because we are talking about probabilities, our finding supports our research hypothesis, but do NOT necessarily prove it is true.
The final block of code you created in RStudio should look like this:
Assessing regression assumptions
Normality assumption
Next, let’s run regression checks. Without knowing whether our data meet the regression assumptions we cannot know whether our model may be inaccurate, and therefore a bad predictor. Some of the checks require that we save the residuals from the model. Similar to above where we utilized “age” and “fun” in a regression by listing the dataframe name with a dollar sign before the variable name, we can access different model-related objects by listing the model name with a dollar sign before the object name. Thus, to extract the residuals from our model we type “reg1$residuals”. We save the residuals to the dataframe by indicating the dataframe name, then a $, and the name we would like to save the residuals as:
df$resid <- reg1$residuals
As some checks require the data be ordered a certain way, and we only have a single predictor, we will order it by that predictor. I like keeping my original dataframe intact, copying it, and then ordering the copy. To copy the dataframe, select a new name such as "df1". Put the name on the left side and have an arrow pointing from the original dataframe name to the new one, much like creating anything in R.
Then, utilize the "order" command to order the dataframe by the predictor. Within the square brackets the variable on the left indicates how to order the rows, and the one on the right is for ordering the columns. Here, there is only a space to the right of the comma, because we do not need to order the columns. You will also need to specify the dataframe the predictor is coming from:
df1 <- df
df1 <- df1[order(df1$age), ]
The first assumption is that the model’s residuals (a.k.a. the error terms) are normally distributed. While there are many ways to perform most things in R, I like assessing normality with a QQ plot that provides confidence intervals. This QQ plot is part of the “car” library. A library is a package directory that includes different compiled code, and functions that we can use by typing in the appropriate commands.
To use a library in R you must first download the package for it, and then load the library (don’t worry—we are talking about two short lines of code here). While the package only needs to be downloaded once, the library must be loaded every time you start R. To download a package utilize the “install.packages” command, putting the package name in parentheses inside of the ():
install.packages(“car”)
The “library” command loads the library, and does not require parentheses inside of the ():
library(car)
To generate the QQ plot, utilize the code “qqPlot”, putting the model name (“reg1”, in this example) in the ():
qqPlot(reg1)
The output appears in the bottom right pane. As you can see, we meet the normality assumption as all the points are within the confidence intervals:
Independence assumption (a.k.a. Autocorrelation)
The second assumption is the independence of residuals. Here, we assume that we cannot predict one residual from its neighbor—that they are not correlated. We will utilize the Durbin-Watson test to examine independence. This test produces a statistic between 0 and 4. The closer the statistic is to 2, the more likely it is that the independence assumption is met. This test provides a p-value as well. Note that since the null hypothesis for this test is that independence exists, you want to FAIL the test; you want a large p-value, which would allow retaining the null hypothesis.
Since you are looking to fail the test, I would look for a p-value much larger than .051. You want to obtain a value that you are comfortable in judging as not even trending. A value in the .2 area is a good rule of thumb, and I would be wary of values lower than .1. The specific Durbin-Watson test that we will use here is based on a bootstrapped simulation. This means that every time you run this test with the same model you will get a slightly different result. However, all values should be in the same neighborhood.
To run the test type the code:
durbinWatsonTest(reg1)
I ran the test 5 times, and received p-values between .106 and .142 . Since I prefer to see p-values closer to .2, and running a second independence test is fast and easy, I will do so. We run the autocorrelation function with the command "acf" on the residuals that we saved earlier:
acf(df1$resid)
The acf produces a graph. We ignore the first line, and look to see whether all the other lines fall within the two horizontal confidence intervals. In this graph we see that they do, which means that we meet the assumption of independence:
Heteroskedasticity assumption
Finally, we test whether the variance in residuals is constant utilizing a Breusch-Pagan test in the “lmtest” library. First, we install the package:
install.packages(“lmtest”)
The “library” command loads the library, and does not require parentheses inside of the ():
library(lmtest)
The Breusch-Pagan test is run with the command “bptest” and the model name. Like the Durbin-Watson test you want to observe a large p-value in order to reject the alternative hypothesis:
bptest(reg1)
As the output includes a p-value of over .5, we can assume we meet this assumption as well.
If you would like to examine linearity, plot the residuals against the predicted values, and look to see whether the plot looks like a random splatter, or whether there is some sort of pattern (indicating non-linearity). Like to the residuals, the predicted values are easily extracted from the regression model, and can be plotted with the plot function:
plot(df1$resid ~ reg1$fitted.values)
In a future post discussing regressions, I will explain how to run regressions with multiple predictors and interactions, and my favorite ways to graph the results. It is wickedly simple to plot interactions of continuous variables in R! However, in this post we are focusing on bootstrapping.
Bootstrapping (finally!)
In order to bootstrap in R we utilize the “boot” library. First, install the package. Following the installation, load the library:
install.packages(“boot”)
library(boot)
Bootstrapping begins with a function. In short, a function is a rule, or a mini program, describing how to manipulate input, and what output to return. Similar to the way we assign names for all variables and models we want to utilize in R, we assign functions names as well. Here, I will use the name “bootfunc”. Predictably, the code to create a function is “function” followed by a (). Inside the () I have placed two names: “dataset” and “random”. These names are not anything we have previously created or imported, and you can change them to be anything you would like (I can see the creative gears turning in your head!) However, the order of the names is important. The first name is the placeholder for your dataframe name, and the second is the place holder for the indices (basically your data) that you want to sample. The only real implication for this is that if you choose to alter these names to anything other than “dataset” and “random”, make sure you alter them in the other places that they appear in code as well:
bootfunc <- function(dataset, random)
The function also has a set of curly brackets. These begin right after the (), like so:
bootfunc <- function(dataset, random) {
Next, you define what parts of the dataframe you want to sample. You can sample by row or column. Since the “random” is listed in the left part of the square brackets, and there is nothing after the comma to the right of it, R will sample the dataframe by row. We store this information in a variable we are naming “d”. Like “dataset” and “random”, you can rename “d” to be whatever you wish. However, make sure that you consistently use the new name throughout the code:
d <- dataset[random, ]
After defining how you want the data to be sampled—which are the data you will run the model on—it is time to specify the model. Note how similar specifying the model here is to running the regression model above. However, rather than specifying the dataframe name after “data = “, we enter the name “d”, which represents the sampled data (that we defined above). We then save the output of the regression with its own name:
reg <- lm(fun~age, data = d)
regsum <- summary(reg)
Finally, we save the statistic of interest. This is the statistic whose bootstrapped confidence interval we are interested in. In this example we are interested in the independent variable’s coefficient. To obtain the coefficient we extract it from the summary, and save it. Since the model has only one predictor, and the first coefficient is the intercept, putting the number 2 in the square brackets extracts the coefficient of the predictor. Depending on how many predictors you have, and which one you are interested in extracting, the number entered into the square brackets will change. If you were interested in extracting a different statistic from the model, this is the line of code you would change. If you were interested in extracting both the coefficient of this variable and another statistic (or an additional coefficient of an additional, hypothetical predictor), add another line of code below this one specifying what you wish to extract. Additionally, assign it its own unique name:
funcoef <- regsum$coefficients[2]
We then save the statistic of interest (or statistics, depending on how many you are interested in bootstrapping) into a vector (very similar to the way we saved the variables “age” and “fun” above), and assign it the name “booted.paths” (although, as has been the theme, you can assign it a different name). Since we are booting a single statistic, this step is not strictly necessary, although does no harm either. It is, however, necessary when you are interested in more than one statistic. If you keep this line, the following line of code has the vector name in the (). Otherwise, input the name of the statistic (in this example it would be “funcoef”) into the (). We complete the function with code requesting the vector be returned, and a closing curly bracket:
booted.paths <- c(funcoef)
return(booted.paths)
}
After completing the function (which we named “bootfunc”), it is time to bootstrap utilizing the “boot” function. This includes specifying the dataframe name, the statistic we would like to bootstrap (which is what we wrote the function for, and is the function name), as well as the number of bootstraps we would like to perform, as indicated by the R. I recommend playing with the number of bootstraps to see how many your computer can easily handle. As delineated in the previous post, you would ideally run as many bootstraps as it would take for the results to converge. However, as this is usually time consuming, you probably will not run that many. The 5000 bootstraps will be saved in “bootresults”:
bootresults <- boot(data = df, statistic = bootfunc, R = 5000)
To obtain the confidence intervals for the bootstrapped statistic, utilize the “boot.ci” function. In this function first specify the where you saved the bootstraps. Here we saved the bootstraps in “bootresults”. The next part—specifying the index—is only strictly necessary if you have more than one statistic you are bootstrapping. In this this example we only bootstrapped a single statistic—“funcoef”—which is why the vector “booted.paths” only has a single statistic listed in it. Since it is the first statistic in the vector (and the only one at that), its index is 1. If the vector had more than one statistic, to receive confidence intervals for each of the statistics you would retrieve them according to the order they were listed in the vector by listing their index number. Next, list what confidence intervals you want. Here it is set to 95%, which means that we will cut off the top and bottom 2.5% of the distribution. Finally, list what type of confidence intervals you would like. The options are “norm”, “basic”, “stud” (i.e. studentized), “perc” (i.e. percentile), and “bca”. The sixth option is “all”, and will return all five confidence intervals. We will utilize the bca method (if you are interested in reading more about the confidence interval types I recommend this link: https://goo.gl/6xKvXX). After storing the confidence intervals in “funcoef”, we then ask R to display what was stored in it:
funcoef <- boot.ci(bootresults, index = 1, conf = .95, type = "bca")
funcoef
To summarize, your code will appear as follows:
bootfunc <- function(dataset, random) {
d <- dataset[random, ]
reg <- lm(fun~age, data = d)
regsum <- summary(reg)
funcoef <- regsum$coefficients[2]
booted.paths <- c(funcoef)
return(booted.paths)
}
bootresults <- boot(data = df, statistic = bootfunc, R = 5000)
funcoef <- boot.ci(bootresults, index = 1, conf = .95, type = "bca")
funcoef
The following is the output I received. Your output will likely vary slightly from mine, as R is randomly sampling the data:
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 5000 bootstrap replicates
CALL :
boot.ci(boot.out = bootresults, conf = 0.95, type = "bca", index = 1)
Intervals :
Level BCa
95% (-0.2973, -0.0527 )
Calculations and Intervals on Original Scale
To determine whether the statistic is significant, examine the confidence intervals, and see whether they include 0. Since they do not include 0 here, we can conclude that age predicts fun scores. Unsurprisingly, as our regression model met the assumptions, the significance finding from both methods overlap.
Hi aria, great article thanks! Just wondering if you have a new link for "why bootstrapping is awesome" the original one has expired?