Bump Chart: R-Programming

SHASHANK SHINDHE Ph.D

Published May 24, 2020

Bump chart is a specialized line chart that is used to represent 3 variables (one among which contains countable number of components or parts) in a two-dimensional lay-out. It is a special purpose visual aid which helps in studying the relative positions of different constructs of a variable either chronologically or spatially. The bump chart finds its applications in scenario's wherein you either have to follow a leader or understand the popularity of consumer durables, automobiles etc.

Bump charts are not a new discovery. They have been around for years, but they are brought to prominence by the work of Matt Chambers. His work represented the evolution of car colors in America over the period from 2000 to 2015. This visualization gained immense popularity on tableau public and was nominated for "Viz of the year 2016" award.

The bump charts are fairly easy to understand and interpret. In theory, a bump chart is a collection of lines (one for each construct of a variable) placed on a plot and change directions and inclinations based on the magnitude of the underlying variable. If any line in the plot crosses another line, it indicates a change in the relative positions (or ranks) of constructs with respect to each other.

Now, let us try to represent, the relative positions of different states of India, according to the number of positive COVID-19 patients in each of them using a bump chart. This piece, deals with the creation of the chart using R-programming. Apart from this platform, Tableau, RAWGraphs by Density design labs and Python. The data used to construct this chart is sourced here. The total number of active COVID-19 cases across India is considered and the rankings of its polities are represented as components (lines).

Data Wrangling

The required data from construction of the bump chart will have a data frame containing all components (cases in each state) of the variable (cases in India) ranked at the same point in time. So in order to reach there, first we will read the data using the readr package. Along with it, we shall load dplyr (for data manipulation), reshape2 (for restructuring the data), ggplot2 (the data visualization package) and directlabels (to make plots more legible).

After reading the datasets, available at the source, in .csv format, we extract the required information from these datasets and merge them to create a single data frame.

>library(ggplot2)
>library(dplyr)
>library(readr)
>library(reshape2)
>library(directlabels)


>raw_data1<-read_csv("https://api.covid19india.org/csv/latest/raw_data1.csv")
>raw_data2<-read.csv("https://api.covid19india.org/csv/latest/raw_data2.csv")
>raw_data3<-read_csv("https://api.covid19india.org/csv/latest/raw_data3.csv")
>raw_data4<-read_csv("https://api.covid19india.org/csv/latest/raw_data4.csv")


>df1<-data.frame(raw_data1$`Date Announced`,raw_data1$`State code`)
>colnames(df1)<-c("date_detected", "state_detected")
>df2<-data.frame(raw_data2$Date.Announced, raw_data2$State.code)
>colnames(df2)<-c("date_detected", "state_detected")
>df3<-data.frame(raw_data3$`Date Announced`, raw_data3$`State code`)
>colnames(df3)<-c("date_detected", "state_detected")
>df4<-data.frame(raw_data4$`Date Announced`, raw_data4$`State code`)
>colnames(df4)<-c("date_detected", "state_detected")
>df.all<-bind_rows(df1, df2, df3, df4)

After merging the data sets in to one data frame, it will look like

>head(df.all) 

  date_detected state_detected
1    30/01/2020             KL
2    02/02/2020             KL
3    03/02/2020             KL
4    02/03/2020             DL
5    02/03/2020             TG

6    03/03/2020             RJ

Next, we will create a cross tabulation of date and states so that we will have a data frame of the form crossdf.

>df.all$date_detected<-as.Date(df.all$date_detected, "%d/%m/%Y")
>cross.table<-table(df.all$date_detected, df.all$state_detected)
>crossdf<-as.data.frame.matrix(cross.table)

>head(crossdf)
           AN AP AR AS BR CH CT DL DN GA GJ HP HR JH JK KA KL LA MH ML MN MP MZ
2020-01-30  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  1  0  0  0  0  0  0
2020-02-02  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  1  0  0  0  0  0  0
2020-02-03  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  1  0  0  0  0  0  0
2020-03-02  0  0  0  0  0  0  0  1  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
2020-03-03  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
2020-03-04  0  0  0  0  0  0  0  0  0  0  0  0 14  0  0  0  0  0  0  0  0  0  0
           OR PB PY RJ SK TG TN TR UN UP UT WB
2020-01-30  0  0  0  0  0  0  0  0  0  0  0  0
2020-02-02  0  0  0  0  0  0  0  0  0  0  0  0
2020-02-03  0  0  0  0  0  0  0  0  0  0  0  0
2020-03-02  0  0  0  0  0  1  0  0  0  0  0  0
2020-03-03  0  0  0  1  0  0  0  0  0  0  0  0
2020-03-04  0  0  0  1  0  0  0  0  0  7  0  0

After, obtaining a cross tabulation of the data frame, we compute the cumulative sum of number of cases in each state and melt the data frame so that we get a data frame containing the cumulative cases in each state stacked on to each other rather than side-by-side as in the data frame crossdf.

>df.new<-as.data.frame(apply(crossdf,2,cumsum))
crossdf.new<-melt(as.matrix(df.new, id.vars = row.names(df.new), variable.name = 'States'))
>colnames(crossdf.new)<-c("Day","States","Total Cases")

The melted data frame will have the following structure.

>head(crossdf.new, 20)

       Day   States Total Cases
1  2020-01-30     AN           0
2  2020-02-02     AN           0
3  2020-02-03     AN           0
4  2020-03-02     AN           0
5  2020-03-03     AN           0
6  2020-03-04     AN           0
7  2020-03-05     AN           0
8  2020-03-06     AN           0
9  2020-03-07     AN           0
10 2020-03-08     AN           0
11 2020-03-09     AN           0
12 2020-03-10     AN           0
13 2020-03-11     AN           0
14 2020-03-12     AN           0
15 2020-03-13     AN           0
16 2020-03-14     AN           0
17 2020-03-15     AN           0
18 2020-03-16     AN           0
19 2020-03-17     AN           0

20 2020-03-18     AN           0

After, melting the data frame, now we will rank the states corresponding to each date based on the cumulative cases of COVID-19 till that date. In case of ties in the number of cases, states will be sorted alphabetically. For simplicity of the presentation and due to non-availability of cases the data from the beginning till 2nd of March (it is also the 33rd day since the enumeration of first ever case in India) is omitted.

>crossdf.rank<-crossdf.new %>%
  group_by(Day) %>%
  arrange(Day, desc(`Total Cases`), States) %>%
  mutate(ranking=row_number(), day = as.numeric(as.Date(Day))-18290) %>%
  as.data.frame()
>crossdf.rank$Day<-as.Date(crossdf.rank$Day, "%Y-%m-%d")
>sub_crossdf.rank<-crossdf.rank[crossdf.rank$Day>=as.Date("2020-03-02", "%Y-%m-%d"),]


> head(sub_crossdf.rank,20)
           Day States Total Cases ranking day
106 2020-03-02     KL           3       1  33
107 2020-03-02     DL           1       2  33
108 2020-03-02     TG           1       3  33
109 2020-03-02     AN           0       4  33
110 2020-03-02     AP           0       5  33
111 2020-03-02     AR           0       6  33
112 2020-03-02     AS           0       7  33
113 2020-03-02     BR           0       8  33
114 2020-03-02     CH           0       9  33
115 2020-03-02     CT           0      10  33
116 2020-03-02     DN           0      11  33
117 2020-03-02     GA           0      12  33
118 2020-03-02     GJ           0      13  33
119 2020-03-02     HP           0      14  33
120 2020-03-02     HR           0      15  33
121 2020-03-02     JH           0      16  33
122 2020-03-02     JK           0      17  33
123 2020-03-02     KA           0      18  33
124 2020-03-02     LA           0      19  33
125 2020-03-02     MH           0      20  33

The data frame in the above format is ready to be plotted. So without a due, we will start with the preparation of the plot.

Plotting the data frame

The basic version of the plot is obtained by using a simple ggplot function as given below.

>ggplot(data = sub_crossdf.rank, aes(x=day, y=ranking, group=States))+
   geom_line(aes(color=States, alpha=1), size=2)+
   geom_point(aes(color=States, alpha=1), size=2)+
   scale_y_reverse(breaks = 1:nrow(sub_crossdf.rank))+
   scale_x_continuous("day", labels = as.character(sub_crossdf.rank$day), breaks = 
     sub_crossdf.rank$day)+ 
   labs(x = "Number of days", y = "Ranks", title = "Number of COVID-19 cases in 
     India", subtitle = "States ranked based on the number of active cases")

The above graph is a primitive one and needs to be formatted and polished. First of all, the plotting of all 35 states (including union territories) is not necessary, so we can limit the number of lines to some of the top ranking states (we will limit to 15 states). We can add labels at the beginning and at the end of the line to identify the line corresponding to a state, hence there is no need to have a legend. Finally, the X-axis ticks are more in number and are overlapping. This can be rectified by representing these ticks tilted to certain angle. Also, in order to make the plot more clear, we will consider a custom styling function as given by Dominik Koch.

theme_custom<- function ()
{
  color.background = "white"
  color.text = "#22211d"
  
  #Construction of chart
  theme_bw(base_size = 15)+
    theme(panel.background = element_rect(fill = color.background, color = color.background))+
    theme(plot.background = element_rect(fill = color.background, color = color.background))+
    theme(panel.border = element_rect(color = color.background))+
    theme(strip.background = element_rect(fill = color.background, color = color.background))+
    #formatting the grid
    theme(panel.grid.major.y = element_blank())+
    theme(panel.grid.minor.y = element_blank())+
    theme(axis.ticks = element_blank())+
    #Formatting the legend
    theme(legend.position = "none")+
    #Formatting title and axis labels
    theme(plot.title = element_text(color = color.text, size = 20, face = "bold"))+
    theme(axis.title.x = element_text(size = 14, color = "black", face = "bold"))+
    theme(axis.title.y = element_text(size = 14, colour = "black", face = "bold", vjust = 1.25))+
    theme(axis.text.x = element_text(size = 10, angle = 90, color = color.text, vjust = 0.5, hjust = 0.5))+
    theme(axis.text.y = element_text(size = 10, colour = color.text))+
    theme(strip.text = element_text(face = "bold"))+
    #formatting plot margins
    theme(plot.margin = unit(c(0.35, 0.2, 0.3, 0.35), "cm"))
}


ggplot(data = sub_crossdf.rank, aes(x=day, y=ranking, group=States))+
  geom_line(aes(color=States, alpha=1), size=2)+
  geom_point(color = "#FFFFFF", size = 4)+
  geom_point(aes(color=States, alpha=1), size=4)+
  geom_point(color = "#FFFFFF", size = 1)+
  ylim(rev(c(1,15)))+
  geom_dl(aes(label=States), method=list(dl.combine("first.points","last.points"), cex=1))+
  scale_x_continuous("day", labels = as.character(sub_crossdf.rank$day), breaks = sub_crossdf.rank$day)+
  labs(x = "Number of days", y = "Ranks", title = "Number of COVID-19 cases in India", subtitle = "States ranked based on the number of active cases")+
  scale_color_manual(values = c("#00641c","#d941c1","#60d954","#894ed5"...))+
 theme_custom()

Finally, we obtain the following plot. The appearance of the plot may be made appealing by considering less observations of the horizontal axis and/or increasing the plotting area. Now, with this plot, the most severely hit states can be tracked along with studying the recovery or escalation of the condition over time.

However, it should be kept in mind that the plot only represents the relative positions of the states and does not reflect on the conditions or measures taken by the governments to control the situation. It is because, the proliferation of COVID-19 depends on a lot of other factors.

Hope, you will enjoy creating a bump chart for yourselves. I certainly have enjoyed writing this. Please Like, Share or Criticize. Happy Bumping!!.

Bump Chart: R-Programming

SHASHANK SHINDHE Ph.D

Data Wrangling

Plotting the data frame

Others also viewed

Seaborn for Data Analysis: Top 10 Plots to Make Your Insights Pop

DAX ANALYSIS PART 2: HANDLING MISSING DATA AND TIME SERIES VISUALIZATION

The visually appealing R package "ggplot2" for Data Science & Visualization

Spark Tidbits - Lesson 12

Choosing the Right Visualization Tool: Plotly vs. Matplotlib and Seaborn

Data Visualizations with ggplot2, Top DataViz People on LinkedIn, Book of the Week

Feature Selection and Data Visualization

MLops : Session - 13

Data Visualization in R with ggplot2 vs MS Excel

Regression_Problem_With_All_Algorithm

Explore content categories