Getting Started with Apache Spark in R

Kevin Brendle

Published Jan 8, 2018

January 8, 2018

What is Spark?

If you have never heard of Spark do not be afraid. Spark is a framework for working with large amounts of data. It is generally used with computing clusters to mine large datasets quickly and efficiently. The main programming languages for Spark are Scala and Java but there are also APIs for Python and R. This tutorial will get you started with working with Spark so that you can get comfortable with the tools that are being used by many companies and will continue to be used more as data continues to grow.

The big advantage of Spark is that you do not have to load the data into your computer’s memory to perform your analysis so you can mine huge data sets with minimal hardware. For this reason it is used by companies that have massive amounts of data, probably residing in a Hadoop distributed file system. What is great about Spark without the massive infrastructure is that it allows you to perform analysis on your local machine that would have been difficult and slow otherwise if even possible at all.

For getting started with Spark it is not necessary to have a server cluster, all you need is a basic computer. The computer that I am using as I write this is an HP Envy from 2011 that has a quad core 1.73 GHz processor with 8 Gbs of RAM, it is by no means a high performance laptop but can still run spark well enough to get to learn it even with datasets more than twice the size of my available RAM.

Install sparklyr and Spark

This tutorial will walk through some basic uses of Apache Spark. This tutorial will be using the sparklyr packaged developed by RStudio to provide ease of use of Spark with R. We will also be using Spark version 2.0.0. The fist step is to install the sparklyr package and use it to install Spark. This guide is not meant to be very extensive but will get you started with the package. The DataCamp course for sparklyr is a very good resource for more in depth information. You can also download the sparklyr cheat sheet from RStudio here. The guide will start with importing data, reformatting the data, preparing for analysis, building some models, and comparing the model fit.

There have recently been some changes to the sparklyr syntax so this may look different if you have worked with sparklyr previously or look at other tutorials later. There are many also many different types of systems out there that have different specifications and some of those will change how Spark will operate on your machine. While writing this all of the code worked and there was enough memory for Spark to run but depending on your system that could be different so if you run into an ‘out of memory error’ that does not mean that it won’t work just that some settings needs to be changed; this topic will be covered in the near future but this is a small dataset so there should not be any problems with memory.

# if you do not have the 'devtools' package install that first so that
# we can install sparklyr from Github to get the most current version
install.packages("devtools")

# once 'devtools' is installed we can install sparklyr from Github
devtools::install_github("rstudio/sparklyr")
library(sparklyr)

# install Spark from sparklyr
spark_install(verison = "2.0.0")

# we are also going to need dplyr for working with sparklyr, readr for
# some trasnformation, and purrr for some functional programming so we
# will install the tidyverse to get these and other great packages
devtools::install_github("ttidyverse/tidyverse")
library(tidyverse)

# one last package is caret that we will be using for model comparison
devtools::install_guthub("topepo/caret")
library(caret)

Importing Data

In this example we will work with the credit card default data available from UCI Machine Learning Repository here. There you can get information about the data and descriptions of the column names.

# First we need to download the .zip file containing the dataset
download.file("http://archive.ics.uci.edu/ml/machine-learning-databases/00372/HTRU2.zip", 
"/home/kevin/Downloads/htru2.zip")

# Unzip the file into your specified location
unzip(zipfile = "/home/kevin/Downloads/htru2.zip", 
exdir = "/home/kevin/Downloads")

# Load in a few rows to check for the header
DATA <- read_csv("/home/kevin/Downloads/HTRU_2.csv", n_max = 10)
# There is no header so we will have to make not of that and add one

# Connect to Spark
sc <- spark_connect(master = "local")


# Read in the .csv file to spark
DATA <- spark_read_csv(sc, "htru2", "/home/kevin/Downloads/HTRU_2.csv", 
  memory = FALSE, header = FALSE)
# Now that the CSV file is read in we can use it from here but we are 
# going to convert it into parquet format for slimmer storage and 
# faster processing. The difference in processing time may not be 
# recognized as much on this dataset but with larger data the parquet 
# format is very beneficial.

# Let's see what the current column names are
colnames(DATA)

# Now we can rename the columns. These names are abbreviaed versions 
# of what the columns are, the information is on the UCI website where
# the data is located: http://archive.ics.uci.edu/ml/datasets/HTRU2.
DATA <- DATA %>%
rename("mean_profile" = V1, "sd_profile" = V2, "kurtosis_profile" = V3,
  "skewness_profile" = V4, "mean_curve" = V5, "sd_curve" = V6,
  "kurtosis_curve" = V7, "skewness_curve" = V8, "class" = V9)

# Here we will write the parquet file
spark_write_parquet(DATA, "/home/kevin/Downloads/htru2_parquet")

# We can remove the .csv file from the spark environment now that we 
# are done with it
db_drop_table(sc, "htru2")

# Here we will read in the parquet file
DATA <- spark_read_parquet(sc, "htru2_p", 
   "/home/kevin/Downloads/htru2_parquet", memory = FALSE)

We have now successfully read in the parquet file and can begin to work with the data.

Inspecting the Data

The first thing you should always do during your analysis is inspect your data. It is useful to have an idea about what you are working with in case you run into some issues. In Spark this can be a little more complicated because you cannot plot histograms of all the columns when they contain 10,000,000 records but you still need to know if there is something that is out of place. The method used here is likely not the best way to go about this but it is a start and will give you some information about your data before starting the analysis.

# First, let's say that we are not sure what our target variable is 
# and would like to see the first few rows of the data. We can look at
# a specified number of rows using the print command.
DATA %>% print(n = 5, width = Inf) # 5 rows contaiing all columns

# The code above would be equivaluent to print(CREDIT_P, n = 5, 
# width = Inf) so you can see that the object before the pipe (%>%) 
# is the first argument of the function after the pipe.

# Next we will look at some summary statistics
# First we will look at the distribution of the response variable
DATA %>%
select(class) %>%
group_by(class) %>%
summarise(count = n())
# We can see that there are far more of one class

# Now we will look at some summary statistics of each column
# first, create a vector of the summary statistics we would like to 
# calculate
summary_functions <- c("mean", "min", "max", "stddev", "skewness")

# next, initiate matrix for the summary information
summary_stats <- matrix(NA, # we will initialize a NA matrix
nrow = length(summary_functions), # as many rows as there are summary # functions
ncol = sdf_ncol(DATA), # as many columns as there are columns
dimnames = list(summary_functions, colnames(DATA)))

# then, map the summary functions to the columns of the data
stats_map <- map(summary_functions, 
    function(x) unlist(DATA %>% summarise_all(x) %>% collect()))
# finally, loop over the summary information and input into the matrix
for (i in 1:length(stats_map)) {
# replace each row of the summary statistics matrix with statistic of 
# each column
summary_stats[i, ] <- stats_map[[i]]
}
# now we can vie the information
round(summary_stats, 3) # rounded to 3 decimals to make the view better

Now we have gotten a look at the data and can use the summary statistics table to look for anomalies in the dataset. When calculating the summary statistics you notice that ‘stddev’ and ‘skewness’ are not functions found within base R or any of the packages that we have loaded. When computing these summary statistics the names of the functions are being passed as characters to e transformed into SQL via dplyr so we have to use functions that are available in the Spark SQL, a list of these functions can be found here.

Splitting the data

If you have not had a proper introduction to the basics of machine learning that is okay, however, I would recommend getting on that. DataCamp, once again, has many great courses that can teach you all about machine learning; I should mention that I am not affiliated with DataCamp rather just a very satisfied user of their product. Some other resources are the Stanford Lagunita Statistical Learning course that is taught by Trevor Hastie and Robert Tibshirani. If you prefer books then ‘An Introduction to Statistical Learning’ (ISL) is a good choice, There is also a more in depth version 'The Elements of Statistical Learning' (ESL) . ISL and ESL are written by the authors of the Stanford course and ISL is taught in the course.

If you have learned about machine learning then you will know about splitting the data into training and test sets (they go by many names). If you haven't, the idea is just to train your model only on part of the data so that you can judge it’s performance on the other part.

# here we will split the data into training and test sets
split_data <- DATA %>% 
   sdf_partition(training = 0.8, test = 0.2, seed = 865)

# the output of this code will be a list containing 2 connections to 
# randomly sampled parts of the data, 80% of the data in the training 
# sample and 20% in the test sample. These numbers are not standards
# just what I chose at the time. The seed is set for reproducability
# so that the output will be the same each time the script is ran.

Train a Model

We will now train a model using Spark. If you do not know much about fitting predictive models I would strongly suggest looking into more about them so that this code will make more sense and you will be able to interpret the output. Since this data has a binary response variable we will look at fitting a logistic regression model and then look at how well the model predicts on the test data.

# We can traing a logistic model predicting class from all other 
#variables using te code below. The seed is set for reproducability
Log_model <- split_data$train %>% ml_logistic_regression(formula = class~., 
    seed = 865)

responses <- split_data$test %>%
 # select the response variable from the test data
 select(class) %>%# collect the results from Spark into R
 collect() %>%
 # make predictions on the test data
 mutate(predicted_class = predict(Log_model, split_data$test)
)

# We can take a look at the reponse data and we notice that the 
# predictions are class numeric
glimpse(responses)

# Let's change the predictions to integers
responses$predicted_class <- as.integer(responses$predicted_class)

# Now we can use te caret package to view some infromation about the 
# predictions
confusionMatrix(responses$class, responses$predicted_class)

We have now successfully built a model and assessed its performance. We can see that this logistic regression model does predict better than the naive model. There are many other things that we can do from here such as train other models and compare their performance or create better models using cross-fold validation, these topics will be part of another tutorial to come as well as optimizing spark on your machine by experimentation.

Disconnect Spark

The last thing you always need to do at the end of your Spark session is to disconnect. When you disconnect all of the data in the Spark environment will be erased so if you changed data and want to save it make sure you do that before disconnecting.

# This will disconnect the spark instance
spark_disconnect(sc)

I hope this tutorial has been of use to you and let me know if you have any comments or questions. Do keep in mind that Spark and sparklyr are always changing and versioning can change the outcome of your results.

The .Rmd file created in R is available on my Github page.

Citations

Thank you to those who have made these resources, and many others, available to the community.

R. J. Lyon, B. W. Stappers, S. Cooper, J. M. Brooke, J. D. Knowles, Fifty Years of Pulsar Candidate Selection: From simple filters to a new principled real-time classification approach MNRAS, 2016.

R. J. Lyon, HTRU2, DOI: 10.6084/m9.figshare.3080389.v1.

Guillaume Verdier 8y

Thanks for sharing

Peter Schmidt 8y

I agree with the comment above. Should this line: responses <- split_data$test %>% have been responses <- split_data$train %>% I see where you predict on test, but I don't see where you train.

Efe K O. 8y

Thanks for this great article. Would have loved to see how the training set was called to the model.

See more comments

Getting Started with Apache Spark in R

Kevin Brendle

What is Spark?

Install sparklyr and Spark

Importing Data

Inspecting the Data

Splitting the data

Train a Model

Disconnect Spark

Citations

More articles by Kevin Brendle

Others also viewed

Spark Tidbits - Lesson 9

Apache Spark - Memory Allocation

Set up PySpark and Delta Lake on macOS with the command line

Apache Spark – SparkSession

Building a Versioned Data Lake with Apache Iceberg, Spark, and Scala

Hortonworks Use Case Discovery Workshop with Apache PySpark and HDFS for data correlation.

Delta tables with Apache Spark 2.4.x

How to read, process, and write any binary files in Apache Spark?

Start your Journey with Apache Spark — Part 1

Start your Journey with Apache Spark - Part 2

Explore content categories

What is Spark?

Install sparklyr and Spark

Importing Data

Inspecting the Data

Splitting the data

Train a Model

Disconnect Spark

Citations

More articles by Kevin Brendle

Creating a Basic ML Pipeline

Others also viewed

Spark Tidbits - Lesson 9

Apache Spark - Memory Allocation

Set up PySpark and Delta Lake on macOS with the command line

Apache Spark – SparkSession

Building a Versioned Data Lake with Apache Iceberg, Spark, and Scala

Hortonworks Use Case Discovery Workshop with Apache PySpark and HDFS for data correlation.

Delta tables with Apache Spark 2.4.x

How to read, process, and write any binary files in Apache Spark?

Start your Journey with Apache Spark — Part 1

Start your Journey with Apache Spark - Part 2

Explore content categories