Data Science: Is Python better than R?
For decades, researchers and developers have been debating whether Python or R is a better language for data science and analytics. Data science has rapidly grown across a variety of industries including biotechnology, finance and social media. Its importance is being recognized not only by the people working in the industries, but also by academic institutions that are now beginning to offer data science degrees. With the adoption of open source technologies rapidly taking over traditional, closed-source commercial technologies, Python and R have become extremely popular amongst data scientists and analysts.
“Data science job growth chart — Indeed.com
A (Very) Short Introduction
Invented by Guido van Rossum, Python was first released in 1991. Python 2.0 was released in 2000, and eight years later Python 3.0 was also released. Python 3.0 has some major syntax revisions, and is not backward-compatible with Python 2.0. However, there are Python libraries such as 2to3 that automate translation between the two versions. Python 2.0 is currently scheduled for retirement in 2020.
R is invented in 1995 by Ross Ihaka and Robert Gentleman. It was initially an implementation of S programming language, which was invented by John Chambers in 1976. A stable beta version 1.0.0 was released in 2000. Currently, it is maintained by R Development Core Team and the latest stable version is 3.5.1. Unlike Python, R has no major changes in the past that requires syntax conversions.
Guido van Rossum (left) Ross Ihaka (middle) Robert Gentleman (right)
Both Python and R have large user communities and support. A 2017 surveydone by Stack Overflow has revealed that almost 45% of data scientists used Python as their main programming language. R, on the other hand, was used by 11.2% of the data scientists.
“Developer Survey Results 2017” — Stack Overflow
It is important to note that Python, specifically the Jupyter Notebook, has gained tremendous popularity during recent years. While Jupyter Notebookcan be used for languages other than Python, it is mostly used to document and showcase Python programs in browsers for data science competitions such as Kaggle. A survey done by Ben Frederickson has revealed Jupyter Notebook’s percentage of Monthly Active Users (MAU) on Github has risen sharply after 2015.
“Ranking Programming Languages by GitHub Users” — Ben Frederickson
As Python gains its popularity in the recent years, we observe a small decline in the percentage of MAU in Github users coding in R. Nevertheless, both languages are still incredibly popular amongst data scientists, engineers and analysts.
Availability
Initially used in research and academics, R has become more than just a statistical language. R can be easily downloaded from CRAN (The Comprehensive R Archive Network). CRAN is also used as a package manager with more than 10,000 packages available for download. Many popular open-source IDEs such as the R Studio can be used to run R. As a statistics major, I argue that R has a very strong user community on Stack Overflow. Most questions I had regarding R during my undergraduate studies can be answered on Stack Overflow’s R-tagged Q and A. If you are just getting started with learning R, many MOOCs such as Coursera also offer introductory R and even Python classes.
It is just as easy to set up a Python engineering environment on your local machine. As a matter of fact, recent Macs come with built-in Python 2.7 installed along with several useful libraries. If you are an avid Mac user like I am, I recommend checking out Brian Torres-Gil’s Definitive Guide to Python on Mac OSX to get an even better Python setup. Open-source Python package management systems such as PyPI and Anaconda are easily downloadable from the their official sites. At this point, I should probably mention Anaconda supports R as well. Of course, most people would prefer managing packages directly through CRAN. PyPI, or Python in general has significantly more packages than R. However, not all of the 100,000+packages are applicable to statistical and data analyses.
Visualization
Both Python and R have excellent visualization libraries. Created by R Studio’s chief scientist Hadley Wickham, ggplot2 is now amongst one of the most popular data visualization packages in the history of R. I am totally in love with ggplot2’s wide variety of functionalities and customizations. Compared to base R graphics, ggplot2 allows users to customize plot components at a much higher level of abstraction. ggplot2 offers more than 50 types of plots applicable to various industries. My favorite plots include the calendar heatmaps, hierarchical dendrograms and clusters. Selva Prabhakaran has a wonderful tutorial on how to get started with ggplot2.
Calendar Heatmap (top left), Clusters (bottom left) and Hierarchical Dendrogram (right) in ggplot2
Python also has good libraries for data visualization. Matplotlib and its seaborn extension are very helpful in visualizing and producing attractive statistical graphs. I highly recommend you to check out George Seif’s 5 Quick and Easy Data Visualizations in Python with Code for a better understanding of matplotlib. Similar to R’s ggplot2, matplotlib is capable of creating a wide variety of plots ranging from histograms to even vector field stream plots and radar charts. Perhaps one of the coolest features of matplotlib is topographic hillshading, which in my opinion is more powerful than R raster’s hillShade() function.
Topographic hillshading using matplotlib
Both R and Python have wrappers for Leaflet.js, a beautiful interactive map module written in Javascript. I wrote an article earlier on how to visualize property prices using Folium (Folium is a Python wrapper for Leaflet.js). Leaflet.js is one of the better open-source GIS technologies I’ve worked with, as it provides seamless integration with OpenStreetMaps and Google Maps. You can also create appealing bubble maps, heatmaps and choropleth maps easily with Leaflet.js. I definitely recommend checking out Python and R wrappers for Leaflet.js as the installation is much simpler compared to Basemap and other GIS libraries.
Alternatively, Plotly is an amazing graphing library common to both languages. Plotly (or Plot.ly) was built using Python and specifically the Django framework. Its front end was built in JavaScript and has integrations with Python, R, MATLAB, Perl, Julia, Arduino, and REST. If you are trying to build a webapp to showcase your visualizations, I would definitely recommend checking out Plotly as they have great interactive plots with sliders and buttons.
Plotly correlation plots of the Iris dataset
Predictive Analytics
As I mentioned before, both Python and R have powerful libraries for predictive analyses. It is hard to compare the two’s performances in predictive modeling at a high level. R is written specifically as a statistical language, and hence it is much easier to search for information pertaining to statistical modeling compared to Python. A simple google search of the term logistic regression in R will return 60 million results, which is 37 times the number of results you will get from searching logistic regression in Python . However, it is probably easier for data scientists with software engineering backgrounds to use Python simply due to the fact that R was written by statisticians. Although I found both R and Python equally easy to understand compared to other programming languages.
Kaggle user NanoMathias has done a very thorough investigation on whether Python or R is a better tool in predictive analytics. He concluded that amongst data scientists and analysts, the number of Python and R users are pretty equal. An interesting finding from his study is the tendency for people who have been coding for 12+years to choose R over Python. This suggests whether programmers choose R or Python for predictive analytics is nothing more than their personal preferences.
Linear Discriminant Analysis with embeded scalings, R and Python user analysis
Hmm.. so the general consensus is both languages are quite similar in their abilities to make predictions. This is kind of lame, isn’t it? Let’s use R and Python to fit logistic regression models to the Iris dataset, and calculate the accuracy of their predictions. I chose the Iris dataset due to its small size and lack of missing data. No exploratory data analysis (EDA) and feature engineering was done. I simply did a 80–20 train-test split and fit a logistic regression model with one predictor.
library(datasets)
#load data
ir_data<- iris
head(ir_data)
#split data
ir_data<-ir_data[1:100,]
set.seed(100)
samp<-sample(1:100,80)
ir_train<-ir_data[samp,]
ir_test<-ir_data[-samp,]
#fit model
y<-ir_train$Species; x<-ir_train$Sepal.Length
glfit<-glm(y~x, family = 'binomial')
newdata<- data.frame(x=ir_test$Sepal.Length)
#prediction
predicted_val<-predict(glfit, newdata, type="response")
prediction<-data.frame(ir_test$Sepal.Length, ir_test$Species,predicted_val, ifelse(predicted_val>0.5,'versicolor','setosa'))
#accuracy
sum(factor(prediction$ir_test.Species)==factor(prediction$ifelse.predicted_val...0.5...versicolor....setosa..))/length(predicted_val)
95% accuracy achieved with R’s glm model. Not bad!
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
#load data
traindf = pd.read_csv("~/data_directory/ir_train")
testdf = pd.read_csv("~/data_directory/ir_test")
x = traindf['Sepal.Length'].values.reshape(-1,1)
y = traindf['Species']
x_test = testdf['Sepal.Length'].values.reshape(-1,1)
y_test = testdf['Species']
#fit model
classifier = LogisticRegression(random_state=0)
classifier.fit(x,y)
#prediction
y_pred = classifier.predict(x_test)
#confusion matrix
confusion_matrix = confusion_matrix(y_test, y_pred)
print confusion_matrix
#accuracy
print classifier.score(x_test, y_test)
90% accuracy achieved with Python sklearn’s LogisticRegression model
Using R stat’s glm function and Python scikit-learn’s LogisticRegression , I fit two logistic regression models to a randomized subset of the Iris dataset. We only used one predictor, sepal length in our models to predict the species of the flowers. Both models achieved 90% or higher accuracies, with R giving a slightly better predictions. It is, however, insufficient to prove that R has better predictive models than Python. Logistic regression is only one of the many predictive models you can build with Python and R. One aspect where Python edged out R is its well-built deep learning modules. Popular Python deep learning libraries including Tensorflow, Theano and Keras. These libraries are sufficiently documented and I am sure Siraj Raval has hundreds of Youtube tutorials on how to use them. To be completely honest, I’d rather spend an hour coding dCNNs (deep convolutional neural networks) in Keras than spending half a day figuring out how to implement them in R. Igor Bobriakov has made an excellent infographic depicting numbers of commits and contributors to popular libraries in Python, Scala and R. I highly recommend reading his article (link provided below).
“Comparison of top data science libraries for Python, R and Scala [Infographic]” — Igor Bobriakov
Performance
Measuring speed of a programming language is usually considered a biased task. Each language comes with built-ins optimized for specific tasks (like how R is optimized for statistical analyses). Performance testing with Python and R can be done in many different ways. I have wrote two simple scripts in Python and R to compare the load times of Yelp’s academic user dataset, which is slightly over 2 gigabytes.
R
require(RJSONIO)
start_time <- Sys.time()
json_file <- fromJSON("~/desktop/medium/rpycomparison/yelp-dataset/yelp_academic_dataset_user.json")
json_file <- lapply(json_file, function(x) {
x[sapply(x, is.null)] <- NA
unlist(x)
})
df<-as.data.frame(do.call("cbind", json_file))
end_time <- Sys.time()
end_time - start_time
#Time difference of 37.18632 secs
Python
import time
import pandas as pd
start = time.time()
y1 = pd.read_json('~/desktop/medium/rpycomparison/yelp-dataset/yelp_academic_dataset_user.json', lines = True)
end = time.time()
print("Time difference of " + str(end - start) + " seconds"
#Time difference of 169.13606596 seconds
Hmm… interesting. R loads the json file almost 5 times quicker than Python. Python is known to have faster load times than R as demonstrated by Brian Ray’s tests. Let’s see how both programs handle a large .csv file as .csv is a commonly used data format. We slightly modify our code above to load the Seattle Library Inventory dataset, which is almost 4.5 gigabytes.
R
start_time <- Sys.time()
df <- read.csv("~/desktop/medium/library-collection-inventory.csv")
end_time <- Sys.time()
end_time - start_time
#Time difference of 3.317888 mins
Python
import time
import pandas as pd
start = time.time()
y1 = pd.read_csv('~/desktop/medium/library-collection-inventory.csv')
end = time.time()
print("Time difference of " + str(end - start) + " seconds")
#Time difference of 92.6236419678 seconds
Yikes! R took almost twice as long to load the 4.5 gigabyte .csv file than Python pandas (Python programming language for data manipulation and analysis). It is important to know that while pandas is mostly written in Python, the more critical parts of the library was written in Cython or C. This might have a hidden impact on load times depending on the data format.
Now let’s do something a bit more interesting. Bootstrapping is a statistical method that resamples randomly from a population. I have done enough Bootstrap before to know that it is a time consuming process, as we have to repeatedly resample the data for many iterations. The following code tests the runtimes of bootstrapping 100,000 replications in R and Python respectively:
R
#generate data and set boostrap size
set.seed(999)
x <- 0:100
y <- 2*x + rnorm(101, 0, 10)
n <- 1e5
#model definition
fit.mod <- lm(y ~ x)
errors <- resid(fit.mod)
yhat <- fitted(fit.mod)
#bootstrap
boot <- function(n){
b1 <- numeric(n)
b1[1] <- coef(fit.mod)[2]
for(i in 2:n){
resid_boot <- sample(errors, replace=F)
yboot <- yhat + resid_boot
model_boot <- lm(yboot ~ x)
b1[i] <- coef(model_boot)[2]
}
return(b1)
}
start_time <- Sys.time()
boot(n)
end_time <- Sys.time()
#output time
end_time - start_time
#Time difference of 1.116677 mins
Python
import numpy as np
import statsmodels.api as sm
import time
#generate data and set bootstrap size
x = np.arange(0, 101)
y = 2*x + np.random.normal(0, 10, 101)
n = 100000
X = sm.add_constant(x, prepend=False)
#model definition
fitmod = sm.OLS(y, X)
results = fitmod.fit()
resid = results.resid
yhat = results.fittedvalues
#bootstrap
b1 = np.zeros((n))
b1[0] = results.params[0]
start = time.time()
for i in np.arange(1, 100000):
resid_boot = np.random.permutation(resid)
yboot = yhat + resid_boot
model_boot = sm.OLS(yboot, X)
resultsboot = model_boot.fit()
b1[i] = resultsboot.params[0]
end = time.time()
#output time
print("Time difference of " + str(end - start) + " seconds")
#Time difference of 29.486082077 seconds
R took almost twice as long to run the bootstrap. This is fairly surprising given Python is generally perceived as a ‘slow’ programming language. I am slowly starting to regret running all my undergraduate statistics assignments in R instead of Python.
Conclusion
This article has only covered the fundamental differences between Python and R. Personally, I alter between Python and R depending on the task at hand. Recently, data scientists have been making a push to use Python and R in conjunction. It is also very possible that in the near future a third language will emerge, and eventually edges out Python and R in popularity. As data scientists and engineers, it is our responsibility to keep up with the latest technologies and stay innovative. Finally, I highly recommend you to read Karlijn Willems’s Choosing R or Python for Data Analysis? An Infographic. It is provides a great visual summary to what we’ve discussed in this article, and provides additional information including job trends and median salaries. Comment below and let me know which language you prefer!
Data Scientist: "Person who is better at statistics than any software engineer and better at software engineering than any statistician"