Exploratory Data Analysis
Overview
This is the continuation of my previous article on EDA.
Missing values treatment
In most of real world datasets we don’t get complete data. So we see some missing values in the data. Generally we have NAN, NA, NULL types in data. What is the difference between them? Python dataframe is based on R dataframe. We have NA and NULL in R language. But Python dataframe is built on top of Numpy which has NAN only. NAN means “NOT A NUMBER”. So we can use Numpy IsNAN() or Panda’s IsNA() , IsNull() functions to check missing values. IsNull is alias to IsNA. Coming to actual data, missing values can be in any formatted string like “NA”, “n/a”,”—“, “NAN”. So initially itself, we need to get all the list of these missing values format and while reading the data, we need to provide this list so that Pandas dataframe will treat these values as missing values. There are 2 ways of handling missing values. Removing and imputing. In first case, if we have very less data of missing values, then we can remove them. In second case, we can impute the missing data with another data. For numerical values we generally use mean and for categorical values we use mode. We can use back fill or forward fill methods. They fill missing values with backward value or forward values in dataset.
Outlier treatment
What is an outlier? Which lies outside? Outside of what? Generally in data if we see any abnormal value then we say it’s an outlier. We have another similar term in data science. It is anomaly. What is the difference between outlier and anomaly? Generally data is generated by a data source. Outlier is an observation generated by same data source as of rest of the data but it’s an abnormal observation. When it comes to anomaly, the observation itself generated by a different data source. For example fraud detection process involves anomaly detection algorithm. How to decide it’s an outlier? There are several methods to decide an outlier. One of those methods is Standard deviation method. We know that 99% data of a normal distribution lies within 3 standard deviations. So we can treat the data outside of 3 standard deviations as outlier. Second method is Box plots method. Box plots are used to visually represent the distribution of data with inter quartile ranges.
Q2 : second Quartile
Q3 : third Quartile
IQR : Inter quartile range
We can see the above diagram representing a box plot. Here inter quartile range is Q3-Q1. If we represent these quartiles with standard deviations on normal distribution then we get below diagram.
So outliers are the data which is below Q1-1.5 X IQR and above Q3+1.5 X IQR.
Variable transformation
The name itself says that we are transforming the variable. But why we do that? All machine learning algorithms are based on statistics and mathematics. If our data is in different scales then it will not yield a good model. We use 2 methods for this. Standardization and Normalization. I know both sounds like something we are trying to reduce. What is the difference between them? In Standardization we transform the data such a way that it’s mean =0 and Standard deviation = 1. We use this transformation to represent the data in a way to indicate how many standard deviations it’s away from its mean. The standardization is called Z-score. In Normalization we transform the data such a way that values are in between 0 and 1.
Normalization formula:
Standardization formula:
So when to use what? If we have many outliers then we need to use Standardization. If distribution is not like a normal distribution and standard deviation is very small then we use Normalization.
Variable creation
Why do we create a new variable? From above step variable transformation we solved the problem for numerical variables. But there are some problems with categorical variables also. Categorical variables are represented by text. But how machine learning algorithms understand this data? They know how to deal with numbers only. So we need to convert our categorical variables to some numbers. Let us say we have 2 categories (MALE, FEMALE) in a categorical variable GENDER. Then we can represent those values with values 1 and 0 (say MALE-1 and FEMALE-0). OK it is fine for now. Let us say we have a categorical variable called COUNTRY with values (SINGAPORE, JAPAN, and INDIA) then we can represent them with 0, 1, and 2. (0-SINGAPORE, 1-JAPAN, 2-INDIA). But here there is problem. Algorithms are not as smart as you. They can mistakenly understand that dependency of our target variable is less on SINGAPORE as it’s represented with 0 and more on INDIA as it’s represented with 2. Means, its taking the corresponding value to decide how far it can effect on our target value, which is wrong. So instead of representing these values with 0, 1, and 2 we need to create new variables itself. In this case we create 3 columns for SINGAPORE, JAPAN and INDIA and we represent the values with 1 and 0 for their presence. So far so good. But there is another problem.
If we represent the data like this then the correlation between these columns becomes high. Let’s say our columns are like SINGAPORE, JAPAN, INDIA. So if you want to represent a value for INDIA then column values are 0, 0, 1. Means if fist 2 columns values are zeros then obviously the 3rd column value will become 1. Generally correlation between predictor and target variable is good because as dependency is high, so it’s useful for our model. But correlation between predictors (here categorical columns) is not good. This is called DUMMY VARIABLE TRAP. So in this case we need to remove a column out of 3 columns.
So guys, that’s it for now. It is the end of my EXPLORATORY DATA ANALYSIS article.
Thank you.