EDA(Exploratory Data Analysis)

EDA(Exploratory Data Analysis)

Exploratory Data Analysis refers to the critical process of performing initial investigations on data so as to discover patterns, to spot anomalies, to test hypothesis and to check assumptions with the help of summary statistics and graphical representations.

It is a good practice to understand the data first and try to gather as many insights from it. EDA is all about making sense of data in hand, before getting them dirty with it.

There are many steps involved before making any assumptions . EDA can help identify obvious errors , detect outliers, find interesting relationships between variables/features, as well as patterns hidden inside the data.

EDA techniques allow us to effectively manipulate data to check anomalies, assumptions or Hypothesis testing. In short ,EDA helps us to summarize the main characteristics of datasets , often utilizing data visualization libraries such as Matplotlib, seasborn etc.

Understanding EDA with an example

The first step for EDA is to load data ,either from local system or any online website or repo. Here ,I'll be using a telecom users dataset for example. First, we will import the required libraries.

  • Importing libraries and Data

No alt text provided for this image

You can run the above lines of codes in your Jupyter Notebook.

  • Checking the general characteristics

After importing the data ,we will first check the general characteristics of the data i.e. shape, features, data types etc.

No alt text provided for this image

Here , we are checking the Head of our data set i.e. top five rows of our data. You can see at the bottom left corner , the size of the figure is '5 x 22' which means there are 22 columns in our dataset.

No alt text provided for this image

Here, we are checking the shape of our dataset. As we can see that the shape our dataset is '5986 x 22' that means we have 5986 rows and 22 columns.

No alt text provided for this image

Here ,we are checking the name of our columns . The name of columns appear in order and these can be used to access individual columns by slicing/indexing.

No alt text provided for this image

After checking the names and no. of columns, now we are checking the total no. of null values present in each columns.

As we can see there are only 10 null values in column 'TotalCharges' , so we are now going to fill the missing value with either mean, mode or median depending upon the situation.

No alt text provided for this image

Here, we filled the missing values with the median (since there are only 10 missing values , it wont effect our data very much) and then checking again for the null values.

No alt text provided for this image

After filling the null values and dropping the unrequired columns, now we are going to check the no of unique values in each columns. As these columns will not contribute/affect the final ML model much.

No alt text provided for this image

As we can see ,column 'gender' has two unique values i.e. 'Male' and 'Female'. Similarly , we can check the values of other columns.

No alt text provided for this image

After checking the unique values, now we can see the data types of each columns, above. Now, we are going to check some different insights from the data.

No alt text provided for this image

As we can see , there is not much difference between the mean values of 'MonthlyCharges' and 'TotalCharges' of both the genders.

No alt text provided for this image

As we can see , there is not much difference between the total count of both the genders using internet service.

  • Visualization of data

After inspecting our dataset, now we are going to visualize some of the interesting relationships between the features.

No alt text provided for this image

Here , we are using the subplot feature of seaborn library to plot multiple fig. at once . After running this piece of code we get the following output.

No alt text provided for this image

As we can see above how the user are divided based on the age, gender and dependents.

No alt text provided for this image

After running the above code ,we get the following output. Again ,we used the subplot feature of seaborn library to plot these multiple plots .

Here, we can see the churn with respect to Phoneservice users and then the Boxplot of tenure and Total charges .

No alt text provided for this image

Using the subplot feature , we are now going to plot Churn based on Monthly charges and tenure.

No alt text provided for this image

As we can see the density of Churn customers varies depending upon value of Monthly charges as well as tenure.

No alt text provided for this image

  • Steps after EDA

After EDA is performed , there are further steps such as resampling, feature engineering , normalization, rescaling etc which are performed before the data is ready for a ML model. But all of these steps are possible because of EDA as it gives us insights about our data.









To view or add a comment, sign in

Others also viewed

Explore content categories