EDA(Exploratory Data Analysis)

Abhimanyu Yadav

Published Dec 2, 2021

Exploratory Data Analysis refers to the critical process of performing initial investigations on data so as to discover patterns, to spot anomalies, to test hypothesis and to check assumptions with the help of summary statistics and graphical representations.

It is a good practice to understand the data first and try to gather as many insights from it. EDA is all about making sense of data in hand, before getting them dirty with it.

There are many steps involved before making any assumptions . EDA can help identify obvious errors , detect outliers, find interesting relationships between variables/features, as well as patterns hidden inside the data.

EDA techniques allow us to effectively manipulate data to check anomalies, assumptions or Hypothesis testing. In short ,EDA helps us to summarize the main characteristics of datasets , often utilizing data visualization libraries such as Matplotlib, seasborn etc.

Understanding EDA with an example

The first step for EDA is to load data ,either from local system or any online website or repo. Here ,I'll be using a telecom users dataset for example. First, we will import the required libraries.

Importing libraries and Data

You can run the above lines of codes in your Jupyter Notebook.

Checking the general characteristics

After importing the data ,we will first check the general characteristics of the data i.e. shape, features, data types etc.

Here , we are checking the Head of our data set i.e. top five rows of our data. You can see at the bottom left corner , the size of the figure is '5 x 22' which means there are 22 columns in our dataset.

Here, we are checking the shape of our dataset. As we can see that the shape our dataset is '5986 x 22' that means we have 5986 rows and 22 columns.

Here ,we are checking the name of our columns . The name of columns appear in order and these can be used to access individual columns by slicing/indexing.

After checking the names and no. of columns, now we are checking the total no. of null values present in each columns.

As we can see there are only 10 null values in column 'TotalCharges' , so we are now going to fill the missing value with either mean, mode or median depending upon the situation.

Here, we filled the missing values with the median (since there are only 10 missing values , it wont effect our data very much) and then checking again for the null values.

After filling the null values and dropping the unrequired columns, now we are going to check the no of unique values in each columns. As these columns will not contribute/affect the final ML model much.

As we can see ,column 'gender' has two unique values i.e. 'Male' and 'Female'. Similarly , we can check the values of other columns.

After checking the unique values, now we can see the data types of each columns, above. Now, we are going to check some different insights from the data.

Recommended by LinkedIn

Get to Know Further About Cluster Analysis

Adam Khano 5 years ago

Unlocking the Power of Exploratory Data Analysis (EDA)…

Ashutosh Singh 1 year ago

Choosing the right chart for your initial univariate…

Lakshmi Prabha Ramesh 2 years ago

As we can see , there is not much difference between the mean values of 'MonthlyCharges' and 'TotalCharges' of both the genders.

As we can see , there is not much difference between the total count of both the genders using internet service.

Visualization of data

After inspecting our dataset, now we are going to visualize some of the interesting relationships between the features.

Here , we are using the subplot feature of seaborn library to plot multiple fig. at once . After running this piece of code we get the following output.

As we can see above how the user are divided based on the age, gender and dependents.

After running the above code ,we get the following output. Again ,we used the subplot feature of seaborn library to plot these multiple plots .

Here, we can see the churn with respect to Phoneservice users and then the Boxplot of tenure and Total charges .

Using the subplot feature , we are now going to plot Churn based on Monthly charges and tenure.

As we can see the density of Churn customers varies depending upon value of Monthly charges as well as tenure.

Steps after EDA

After EDA is performed , there are further steps such as resampling, feature engineering , normalization, rescaling etc which are performed before the data is ready for a ML model. But all of these steps are possible because of EDA as it gives us insights about our data.

Tarun Verma 3y

It is really great, and informative.

1 Reaction

Aarshi Raj 4y

Good work

1 Reaction

See more comments

To view or add a comment, sign in

EDA(Exploratory Data Analysis)

Abhimanyu Yadav

Understanding EDA with an example

Recommended by LinkedIn

Others also viewed

Data Visualization and Analysis Part I

Data Visualization the topping cream of Data Science

Exploratory Data Analysis: Techniques for Uncovering Patterns and Trends in Data

Understanding the Different Types of Data Analysis and How to Use Them Effectively

What is Data Analysis A Comprehensive Guide for Everyone

Data Exploration and Data Analysis: Unveiling Insights from Raw Data

Data Structure and Exploratory Data Analysis (EDA) in R

Data analytics

Exploratory Data Analysis

Fundamental Patterns of Data Modeling

Explore content categories

Understanding EDA with an example

Recommended by LinkedIn

Others also viewed

Data Visualization and Analysis Part I

Data Visualization the topping cream of Data Science

Exploratory Data Analysis: Techniques for Uncovering Patterns and Trends in Data

Understanding the Different Types of Data Analysis and How to Use Them Effectively

What is Data Analysis A Comprehensive Guide for Everyone

Data Exploration and Data Analysis: Unveiling Insights from Raw Data

Data Structure and Exploratory Data Analysis (EDA) in R

Data analytics

Exploratory Data Analysis

Fundamental Patterns of Data Modeling

Similar topics

Exploratory Data Analysis in Scientific Research

How to Interpret Data for Informed Decision-Making

Explore content categories