The Data Analysis Process

No alt text provided for this image

Do you know how the data analysis process is carried out? In today’s blog, we are going to learn about the various steps involved in the data analysis process in detail. First, let us see what is data analysis?

According to Wikipedia, data analysis is a process of inspecting, cleansing, transforming and modeling data to discover useful information, informing conclusions and supporting decision-making. There are mainly 5 steps involved in the data analysis process.

  1. Asking Questions
  2. Data Wrangling/ Data Munging
  3. Exploratory Data Analysis (EDA)
  4. Drawing Conclusions
  5. Communicating Results/ Data Storytelling

We will understand each step with the help of the titanic data set.

No alt text provided for this image

Many of you must have watched the Titanic movie released in the year 1997. Those of you who don’t know about the Titanic, it was a luxury British ship that sank in the early hours of April 12, 1912, after striking an iceberg. There were an estimated 2224 passengers on board, and more than 1500 died, making it one of the worst passenger ship disasters in history. So let’s take a closer look at the people who were on board and find out their likelihood of survival.

The following is a part of the titanic data-set consisting of 891 rows and 12 columns.

No alt text provided for this image

Stage 1:

No alt text provided for this image

Asking the right questions is an essential part of the data analysis process. Questions should be measurable, clear and concise. To find out the likelihood of survival of the people on board you can ask questions like

  • What are the data that are not necessary for our analysis?
  • Does gender increase the likelihood of survival?
  • Does the category of passenger class affect the likelihood of survival?
  • Does age affect the likelihood of survival?
  • What is the correlation between the various data?

Stage 2:

Data wrangling/data munging is the process of manually converting/mapping data from one raw form into another format to allow for more convenient consumption and organization of the data. In simple terms, sometimes data is not in the right form to do analysis and hence must be converted to the appropriate form.

Data wrangling can be further divided into 3 substeps:

1. Gathering Data:

Sometimes the company provides you with the required data. But every time that is not the case. If you get a .csv file then you can start right away. Otherwise, you may have to fetch data from APIs or through web scraping. Web Scraping is a technique employed to extract large amounts of data from websites whereby the data is extracted and saved to a local file in your computer or to a database in a table (spreadsheet) format.

For example, I downloaded train.csv from Kaggle for my titanic survivor predictor model.

2. Assessing Data:

Once you get the data now you have to assess the data. This can be done by

  • checking the shape of the data (number of rows and columns)
  • checking data types of various columns
  • checking for missing values
  • checking for duplicate data
  • checking the memory occupied by the dataset
  • high-level mathematical overview of the dataset

In our titanic survivor model, you can find the shape by using dataset_name.shape command. This gives the following output.

No alt text provided for this image

Then you can write the command dataset_name.info() to get the data types of the various columns.

No alt text provided for this image

You can also get a high-level mathematical overview of the data set by the command dataset_name.describe().

No alt text provided for this image

3. Cleaning Data:

Now you have enough information about the features of the data set. So now you have to clean the data. First, you can delete those columns which are not necessary for our analysis. For example, I dropped Name, PassengerId, Ticket, and Cabin as they do not depend upon the survival of a person. Then I replace ‘male’ with 0 and ‘female’ with 1 as data analysis cannot be carried on strings. I also remove those rows in which the age of the person is not known and those rows in which ‘Embarked’ is ‘Q’ (as it is less in number). I have also dropped those rows which have null values in any of the columns. After all these operations, my data set looks something like this consisting of 684 rows and 8 columns (previously having 891 rows and 12 columns).

No alt text provided for this image

You will see that memory usage reduced from 68.1 kB to 48.1 kB. Thus you see that cleaning data reduces the size of the data set and hence reduces the amount of memory occupied by it.

No alt text provided for this image

Stage 3

Exploratory Data Analysis (EDA) refers to the critical process of performing initial investigations on data to discover patterns, to spot anomalies, to test hypotheses and to check assumptions with the help of summary statistics and graphical representations. For example, to answer the questions in stage 1, you can plot graphs to find out the factors that increase the likelihood of survival.

No alt text provided for this image

From the above graph, you can see that gender played an important role in the survival of a person. Females had a greater chance of surviving than males.

No alt text provided for this image

Money can’t buy everything.’ But from the above graph, you can see that people from the 1st class had a greater chance of surviving than the rest.

No alt text provided for this image

If you see the correlation between the various columns you can see that ‘Sex’ has the greatest correlation with ‘Survived’.

Stage 4:

Now that you have explored the data you can draw some conclusions. You can see that gender had played a very important role in the survival of an individual. During the ship-wreck females were given more importance. You can also see that people from Passenger class 1 had a greater priority than the others. Thus, you can say that a female passenger from the 1st class would have more priority over others. On the other hand from the correlation table, you can see that age did not play a major role in the survival of a passenger during the wreck.

Stage 5:

No alt text provided for this image

Communicating results is the final stage of the data analysis process. Now that you are done with our conclusions you have to communicate these results to your other team members. This can be done by blogs, PowerPoint presentations, data visualization techniques like graphs and so on. Another important skill that is required in this step is your communication skills. You need to be very affirmative to be a good data storyteller.

A point to be noted!!

It should be noted that these 5 stages don’t need to be in a linear order. It might happen that in stage 5 the doubts asked, lead you again to stage 1. Also, it might happen that after exploratory data analysis you have some other questions leading you to stage 1. So there is no hard and fast rule that these stages are to be followed in a linear order.

No alt text provided for this image

To Conclude…

No alt text provided for this image

From small businesses to global enterprises, the amount of data businesses generate today is simply staggering.

However, without data analysis, this mountain of data hardly does much other than clog up cloud storage and databases. To uncover a variety of insights that sit within your systems, data analysis is important.

To view or add a comment, sign in

More articles by Purbita Sur

Others also viewed

Explore content categories