Data Analytics of NSE 2000 – 2021
Part I - Using R and R Studio
Introduction
In this analytics work, we are going to analyse the National Stock Exchange of India (referred to as NSE or NIFTY), based on the published data of NSE. NSE publishes this data as frequently as a month, and in some cases in weeks as well. This data is available as a Comma Separated Value (.csv) file, and we are going to use this file as the first step.
Pre-Requisites
Those who are intending to use these analytics must know the basics of R, including R studio. Those who are not conversant with R can still be able to follow, if they are aware of contemporary languages and frameworks such as Java, Python, Julia etc. The user’s system must be installed with R and RStudio, preferably latest version. I am using R version 4.0.4, and RStudio version 1.4.1106, and any latter versions should be reproducible for the users.
What are covered
Data analytics is based on the underlying data. All those top-level analytics are covered in this work, which are based on NSE data, as on April 2022. Some of the significant analytic steps include the following:
Data Gathering
Go to the web site www.nseindia.com, and from the top menu bar, choose MARKET DATA, as shown in Screen 1.
Under this, choose Securities Available for Trading from the list. This will take you to Screen 2, as shown under. Under the Capital Markets header, choose Securities available for Equity segment option.
This will download a .csv file, which is usually named as EQUITY_L.csv[1]. Just to make sure that the downloaded file is alright, inspect this file using any spread-sheet software application that you are comfortable with.
Data Parsing and Understanding
In this step, we are going to load the data into the R environment. As a part of this, fire up your RStudio app and ensure that you have started with a clean slate[2].
In the console, find out which is the current working directory, using the following command:
> getwd()
This will show your current working directory, as say, in my case:
[1] "/Users/kumarbv"
For you guys, it could be different. For windows, it will be different from that of Mac or Linux platforms. Now, for grabbing the Comma Separated Values file, you need to go to the downloaded directory. It’s usually the Downloads folder under your current working directory. Shift yourselves to that directory using the following command on the console:
> setwd(‘Downloads’)
This is reflected on the bar of the console window of the RSrudio. Now, start a new R Script file from the File menu (which is usually given a name Untitled1) and change its name to, say, NSE Data Analysis. This will be saving the script file as NSE Data Analysis.r
In the script file, let’s now load the CSV file using read.csv() command and store it as a data frame object called nse, as follows[3]:
nse <- read.csv(‘EQUITY_L.csv’)
While the cursor is on this line, click on Run button on the RStudio bar as shown[4]:
In the environment panel, you can see the nse object and its details, reflecting that there are 1181 objects with 8 variables. This are the dimensions of the nse data frame. To view this, click on that object, and in the main pane, you will see the completely loaded data file[5].
You can see that the nse data frame has 1181 rows and 8 columns. You can scroll up and down, horizontally and inspect the data. Now, to get the summary of a better understanding the data, in the console, provide the following command:
>summary(nse)
The result will be as shown:
In the above analysis bit, you can see that column-wise analysis is provided for the nse data by R. The columns are SYMBOL, NAME.OF.COMPANY, SERIES, DATE.OF.LISTING, PAID.UP.VALUE, MARKET.LOT, ISIN.NUMBER, and FACE.VALUE. Among these variables, PAID.UP.VALUE, MARKET.LOT, and FACE.VALUE are understood as integer or numeric data, while the rest of the columns (variables) are reflected as character. Also, for each of the non-numeric data, the length is mentioned as 1181, which corresponds to the number of rows in the .csv file.
Data Wrangling
There are few problems in the data, and this we have to rectify before the next steps. This is Data Wrangling. In the existing data, the variables SYMBOL, NAME.OF.COMPANY, and ISIN.NUMBER, are really characters (or strings) and we are okay with that. However, SERIES is a factor data type, and DATE.OF.LISTING is of Date type. We have to enforce that on this nse data first.
In the R Script file, enter the following line and run it:
nse$SERIES <-as.factor(nse$SERIES)
In the console, you can run the following command to see the existing levels of SERIES factor in nse data:
levels(nse$SERIES)
The console shows the following output:
[1] "BE" "BZ" "EQ"
This means that our nse data’s SERIES column is now converted into factor datatype, and in the data, there are three types of SERIES – BE, BZ and EQ.
Next, we will convert the DATE.OF.LISTING variable to date type. To do that, we need to use the lubridate package[6]. Install the package and load it with library command in console:
>library(lubridate)
And run that line by clicking on the Run button. After this, now let’s convert the column the DATE.OF.LISTING variable into datatype by running the following command in the script file:
nse$DATE.OF.LISTING = lubridate::dmy(nse$DATE.OF.LISTING)
The above command is specifically chosen, as the date format of the data file is of DD-MMM-YYYY. Now, click on the Run button, while the cursor is on this line.
Now hit on the run button, and then run the summary(nse) command on the console.
Now, observe the SERIES and DATE.OF.LISTING summary. They are no longer of Class type character.
For series we have the statistics of three different types of SERIES – BE, BZ and EQ. Further, the summary of the SERIES lists that there are 160 BEs, 26 BZs and 1625 EQs. Together, there are 1811, which are the number of rows in the data!
For DATE.OF.LISTING, we have a minimum date, maximum date and other date related analysis statistic summary is generated.
Now, let’s really check if our nse base data’s metadata is already changed. To do this, let’s list all those shares that were listed since beginning of 2020-June, using the following command in the console:
nse[nse$DATE.OF.LISTING >= '2022-06-01', ]
You will get the following output in the console:
Recommended by LinkedIn
Our analysis shows that there were only 2 companies that were listed in the month of June 2020[7].
Data Visualisation and Analysis
Now, let’s dive deep into the Analysis and Visualisation part of the exercise. Here, let’s try to find out how many companies are getting listed each year, and once we get the data for, say, 2000 to 2021, we can plot and see the behaviour or trend. Before we get this data, we will run[8] the following command in the console first:
y2020 = count(nse[nse$DATE.OF.LISTING >= '2020-01-01' & nse$DATE.OF.LISTING <= '2021-12-31', ])[[1]]
The above command gets the number of listed companies in the year 2020, and stores it in a variable called y2020. Run this command in the console and see what is the value of y2020.
As you can see there are 237 companies that were listed in NSE in the calendar year 2020. Now, for the decade covering 2000 to 2020, and including the year 2021, we need to gather this data. Two ways to do this – Hard way - by running the above command and capture the counts 11 times or do the same Easy Way - creating a for loop. We go by Easy way, as shown below:
Before the for loop, we will create a data frame called df, which will store the year and count for that year. Next, we will create two local variables and initialise them – i and totalCount. The variable i is initialised to 1, while the totalCount is initialised to 0.
The for loop will run from range 2000 to 2021, and the variable year is assigned this number. Inside the loop, we will convert this number to string representation of the year beginning (called yb) and year end (called ye). Then we will create two date objects – startDate from yb and endDate from ye. Then we will compute the number of companies using the count() method as was shown in console little earlier. Then, we will populate the df data frame and enhance the local variables i and totalCount suitably.
To run this code, click Run on line 6, 7 and 8 (see above figure) once each. You should see these variables appearing on the Environment tab of right side of the RStudio IDE. Next, click on line 9 once, the entire loop (lines 9 – 19) should run 9 times and the results are generated. The next screen shows the Screen capturing the data frame df which stores the number of companies listed each year from 2000 to 2021.
On the console, you can type df and get the same data as well. By this simple method, we generated data of number of companies listed in NSE from 2000 through 2021. Now, let’s plot the same and visualise it graphically. In the console, use the plot command as follows:
> plot(df)
As you can see that R quickly realises the straight-forward data frame and provides a scatter plot of the same. To have a better visualisation, let’s use the barplot command as shown below:
barplot(df[["count"]], names.arg = df[['year']], horiz = F)
Now, you will get the bar plot as shown below:
You can extend this to plot for 20 – 35 earlier years and have the data and visualisation of the same.
Before ending part, this, let me pick up one more analytic, based on the SERIES part of the data. As you know, NSE offers 3 different types of SERIES – BE, BZ and EQ. Let’s see, based on all the listed companies, what their ratios are. Through the summary() command, we have already seen their numbers, as below:
The numbers of BE, BZ and EQ are, respectively, 160, 26 and 1625, from the above. Let’s capture these numbers, and enter them in the script file and Run them individually as follows:
beCount = count(nse[nse$SERIES == 'BE',])[[1]]
bzCount = count(nse[nse$SERIES == 'BZ',])[[1]]
eqCount = count(nse[nse$SERIES == 'EQ',])[[1]]
Now, let’s create two lists based on these counts, in the script file, as follows:
series = c(‘BE’, ‘BZ’, ‘EQ’)
values = c(beCount, bzCount, eqCount)
Now, in the console, run the pie chart command as follows:
>pie(values, labels = series)
You will get the following plot:
However, we didn’t get the ratios (or percentages) of the SERIES Components, as, in R, they need to be individually computed and pasted. Let’s do this, as follows:
pctBE = paste('BE', round(beCount/sum(values)*100))
pctBZ = paste('BZ', round(bzCount/sum(values)*100))
pctEQ = paste('EQ', round(eqCount/sum(values)*100))
pctSeries = c(pctBE, pctBZ, pctEQ)
This time, when we plot the pie chart, we will use the pctSeries instead of series variable, as follows:
pie(values, labels = pctSeries)
The following would be the resulting plot with percentages of BE, BZ and EQ. Accordingly, we see that BE is 9%, while BZ is just 1% and EQ has the lion’s share of 90%. The graph is as shown:
Data Tweaking for Further Analysis
So far, we just played around with just 2 or 3 variables – DATE.OF.LISTING, SERIES and count of the listed companies in the last 11 years. There are other variables which you might want to analyses and report. For example, PAID.UP.VALUE, FACE.VALUE or MARKET.LOT. Perhaps, if you might want to how many (or what percentage of) EQ series are there with a FACE.VALUE of 5. Or, you might want to how many (or what percentage of) PAID.UP.VALUE of 5 is there in EQ series etc. You may need to tweak the data to calculate, plot and visualise as we did in the last section. I suggest the users to try and see if you could extend the analytics based on the suggested data, and post it on the linked in.
Reporting
Reporting is often ignored, but happens to be an important part of the Data Analytics step. However, it is very easy and sometimes more straight-forward, when you use tools such as JupyterNotebooks, wherein your Analytics can be mixed with the report directly. Reporting, however, should contain an abstract, introduction, and summary, along with detailed analysis, including plots, tables etc. However, much of the reporting should contain the analytics and its results, and some conclusion. For example, in the above analysis, you can conclude that … EQ series happens to be the series where maximum number of shares are listed and it is roughly about 90%... This will give weightage to your analysis, as some important conclusions could be drawn.
Conclusion
In this short analytic exercise, we have used the data published by NSE (India) and analysed different aspects of Data Analytics. The exercise contained all aspects of Data Analytics, including Data Gathering, Parsing and Understanding, Wrangling, Visualisation and, finally, analysis and results.
Foot Notes:
[1] Sometimes, the file name could be different. Please see what file is downloaded.
[2] If you see some variables, history, connections etc., please use Session -> Clear Workspace… option to keep the space clean, and to clear the environment, use the rm(list = ls()) command in the console.
[3] You can also use = instead of <-, but what is preferred is <-. Please stick to that.
[4] Note that we keep using this way to partially run the script file, time and again in this analytics.
[5] Alternatively, on the console, you can type the command >View(nse) to load the file for inspection.
[6] Normally lubridate packages are not loaded, by default. You can do that using the install.packages() command or click on the check-box of lubridate package in the packages tab.
[7] This could be different, as more companies could be listed between today (08-June-2022) and 30-June-2022.
[8] Please note that libraries plyr and dplyr should be loaded.