The Data Analysis Process

Purbita Sur

Published Aug 7, 2019

Do you know how the data analysis process is carried out? In today’s blog, we are going to learn about the various steps involved in the data analysis process in detail. First, let us see what is data analysis?

According to Wikipedia, data analysis is a process of inspecting, cleansing, transforming and modeling data to discover useful information, informing conclusions and supporting decision-making. There are mainly 5 steps involved in the data analysis process.

Asking Questions
Data Wrangling/ Data Munging
Exploratory Data Analysis (EDA)
Drawing Conclusions
Communicating Results/ Data Storytelling

We will understand each step with the help of the titanic data set.

Many of you must have watched the Titanic movie released in the year 1997. Those of you who don’t know about the Titanic, it was a luxury British ship that sank in the early hours of April 12, 1912, after striking an iceberg. There were an estimated 2224 passengers on board, and more than 1500 died, making it one of the worst passenger ship disasters in history. So let’s take a closer look at the people who were on board and find out their likelihood of survival.

The following is a part of the titanic data-set consisting of 891 rows and 12 columns.

Stage 1:

Asking the right questions is an essential part of the data analysis process. Questions should be measurable, clear and concise. To find out the likelihood of survival of the people on board you can ask questions like

What are the data that are not necessary for our analysis?
Does gender increase the likelihood of survival?
Does the category of passenger class affect the likelihood of survival?
Does age affect the likelihood of survival?
What is the correlation between the various data?

Stage 2:

Data wrangling/data munging is the process of manually converting/mapping data from one raw form into another format to allow for more convenient consumption and organization of the data. In simple terms, sometimes data is not in the right form to do analysis and hence must be converted to the appropriate form.

Data wrangling can be further divided into 3 substeps:

1. Gathering Data:

Sometimes the company provides you with the required data. But every time that is not the case. If you get a .csv file then you can start right away. Otherwise, you may have to fetch data from APIs or through web scraping. Web Scraping is a technique employed to extract large amounts of data from websites whereby the data is extracted and saved to a local file in your computer or to a database in a table (spreadsheet) format.

For example, I downloaded train.csv from Kaggle for my titanic survivor predictor model.

2. Assessing Data:

Once you get the data now you have to assess the data. This can be done by

checking the shape of the data (number of rows and columns)
checking data types of various columns
checking for missing values
checking for duplicate data
checking the memory occupied by the dataset
high-level mathematical overview of the dataset

In our titanic survivor model, you can find the shape by using dataset_name.shape command. This gives the following output.

Then you can write the command dataset_name.info() to get the data types of the various columns.

You can also get a high-level mathematical overview of the data set by the command dataset_name.describe().

3. Cleaning Data:

Now you have enough information about the features of the data set. So now you have to clean the data. First, you can delete those columns which are not necessary for our analysis. For example, I dropped Name, PassengerId, Ticket, and Cabin as they do not depend upon the survival of a person. Then I replace ‘male’ with 0 and ‘female’ with 1 as data analysis cannot be carried on strings. I also remove those rows in which the age of the person is not known and those rows in which ‘Embarked’ is ‘Q’ (as it is less in number). I have also dropped those rows which have null values in any of the columns. After all these operations, my data set looks something like this consisting of 684 rows and 8 columns (previously having 891 rows and 12 columns).

You will see that memory usage reduced from 68.1 kB to 48.1 kB. Thus you see that cleaning data reduces the size of the data set and hence reduces the amount of memory occupied by it.

Stage 3

Exploratory Data Analysis (EDA) refers to the critical process of performing initial investigations on data to discover patterns, to spot anomalies, to test hypotheses and to check assumptions with the help of summary statistics and graphical representations. For example, to answer the questions in stage 1, you can plot graphs to find out the factors that increase the likelihood of survival.

From the above graph, you can see that gender played an important role in the survival of a person. Females had a greater chance of surviving than males.

‘Money can’t buy everything.’ But from the above graph, you can see that people from the 1st class had a greater chance of surviving than the rest.

If you see the correlation between the various columns you can see that ‘Sex’ has the greatest correlation with ‘Survived’.

Stage 4:

Now that you have explored the data you can draw some conclusions. You can see that gender had played a very important role in the survival of an individual. During the ship-wreck females were given more importance. You can also see that people from Passenger class 1 had a greater priority than the others. Thus, you can say that a female passenger from the 1st class would have more priority over others. On the other hand from the correlation table, you can see that age did not play a major role in the survival of a passenger during the wreck.

Stage 5:

Communicating results is the final stage of the data analysis process. Now that you are done with our conclusions you have to communicate these results to your other team members. This can be done by blogs, PowerPoint presentations, data visualization techniques like graphs and so on. Another important skill that is required in this step is your communication skills. You need to be very affirmative to be a good data storyteller.

A point to be noted!!

It should be noted that these 5 stages don’t need to be in a linear order. It might happen that in stage 5 the doubts asked, lead you again to stage 1. Also, it might happen that after exploratory data analysis you have some other questions leading you to stage 1. So there is no hard and fast rule that these stages are to be followed in a linear order.

To Conclude…

From small businesses to global enterprises, the amount of data businesses generate today is simply staggering.

However, without data analysis, this mountain of data hardly does much other than clog up cloud storage and databases. To uncover a variety of insights that sit within your systems, data analysis is important.

To view or add a comment, sign in

The Data Analysis Process

Purbita Sur

Stage 1:

Stage 2:

1. Gathering Data:

2. Assessing Data:

3. Cleaning Data:

Stage 3

Stage 4:

Stage 5:

A point to be noted!!

To Conclude…

More articles by Purbita Sur

Others also viewed

CRISP-DM in Action: The Golden Procedure for Data Analysis/Science

Measuring Quality of Data is the Most Important Step for Data Scientist!

How To Build A Successful Data Science Team?

The right approach in data analytics: Solve business problems like a pro!

Data Analysis

What Is Data Exploration? A Simple Guide On Types, Importance, Techniques N More

Every datum of a Data Analyst!

Data: It takes a village, but the buck has to stop somewhere

Top 3 Tools Used for Data Analysis In 2024

Relativity and Data Analysis

How to Develop a Data Analytics Process

How to Analyze Streaming Data

How to Interpret Data for Informed Decision-Making

How to Humanize Data for Decision Making

Explore content categories

Stage 1:

Stage 2:

1. Gathering Data:

2. Assessing Data:

3. Cleaning Data:

Stage 3

Stage 4:

Stage 5:

A point to be noted!!

To Conclude…

More articles by Purbita Sur

Different types of distances used in Machine Learning

The Future of Software Industry: A look into the current trends

An in-depth analysis of the Indian Startup Funding

Different data-based job roles: A comparative study

Machine Learning: From a 5-year old’s perspective

Others also viewed

CRISP-DM in Action: The Golden Procedure for Data Analysis/Science

Measuring Quality of Data is the Most Important Step for Data Scientist!

How To Build A Successful Data Science Team?

The right approach in data analytics: Solve business problems like a pro!

Data Analysis

What Is Data Exploration? A Simple Guide On Types, Importance, Techniques N More

Every datum of a Data Analyst!

Data: It takes a village, but the buck has to stop somewhere

Top 3 Tools Used for Data Analysis In 2024

Relativity and Data Analysis

Similar topics

How to Develop a Data Analytics Process

How to Analyze Streaming Data

How to Interpret Data for Informed Decision-Making

How to Humanize Data for Decision Making

Explore content categories