Exploratory Data Analysis
Overview
Understanding Exploratory Data Analysis from basics by using Python.
Introduction
Data science is the art and science of extracting actionable insights which are used in decision making for business. Data analysis is a branch in Data science. It’s the process of inspecting, transforming and modeling the data with the goal of discovering useful insights. Data Analytics is a branch in Data Analysis. It’s the process of applying algorithmic or statistical processes on data to perform data analysis. There are different methods to perform Data Analytics.
In any Data evaluation process we do encounter following stages.
Problem: Define the business problem and what actually business wants to achieve.
Data: Understand data points and constraints of different data sources. Formulate a data strategy. What is data strategy here? It’s a kind of plan that companies make in order to focus on specific data which really useful to achieve their business goals instead of getting drowned them into it.
Model: it’s the process of converting the strategic data into a format (physically, logically and conceptually) which can be used for analysis.
Analysis: it’s the process of understanding and analyzing the data by using Statistics and Math.
Conclusion: It’s the process of concluding our experiment results do support or contradict with hypothesis. What is hypothesis here? Generally when we get into a business problem, then domain experts and statisticians try to make some hypothesis around the problem. It’s a kind of assumption. And statisticians try to conclude it with hypothesis testing and come up with some stats to supports or contradict with the hypothesis.
Generally we use 3 methods for Data Analysis. The difference in those methods is the sequence of steps that we follow
1. Classical method: Problem => Data => Model => Analysis => Conclusions
2. Exploratory data analysis (EDA): Problem => Data => Analysis => Model => Conclusions
3. Bayesian method: Problem => Data => Model => Prior distribution => Analysis => Conclusions
What is prior distribution here? On top of the model, analysts do incorporate scientific or engineering expertise by imposing a data independent distribution on the parameters of selected model.So now we know where Exploratory Data Analysis (EDA) is coming from and we move on to in detail analysis of EDA itself. For this I am taking weather dataset. Below is the raw dataset.
We are going to use pandas, Numpy, Matplotlib, Seaborn and pandas packages in Python. Load dataset into pandas dataframe and clean the data
Data exploration means describing the data by means of statistical and visualization techniques. We explore the data to understand features and bring important features to our modal. What are features in data? Data is nothing but a collection of events and we try to describe those events with certain features. They are properties of data. And in a dataset they are columns. Quality of input decides quality of output. So our data quality decides our prediction level. Generally it’s a cyclic process where we give prediction level as a feedback to data quality. And as we increase the data quality we should get good results. So almost 70% of the total project is covered by this exploration process. Below are the steps involved in Data Exploration process.
Before going into Data Exploration, we need to understand what is normal distribution? It’s a perfect bell curve of a distribution. It’s about the distribution of data in nature. Central Limit theorem says that in nature if you take any distribution and make samples out of it and plot all means (averages) of those samples, then it forms Normal Distribution. It looks as below.
PROPERTIES OF NORMAL DISTRIBUTION
The normal distribution curve is symmetrical. That means if you cut the distribution from its center the two halves will be an exact mirror image of each other.
There Mean, Median and Mode are exactly the same (in an ideal situation). However, in practice, you should find them very very close.
The total area under the curve is equal to 1.
The Normal distribution is completely described by two parameters IE mean and standard deviation.
The tails of the curve never meet the x-axis
Steps of Data Exploration
1. Variable Identification
2. Univariate Analysis
3. Bi-variate Analysis
4. Missing values treatment
5. Outlier treatment
6. Variable transformation
7. Variable creation
Variable Identification
Identify predictor (input) and Target (output) variables. Then identify variable data type and categories. Variable data types are nothing but character or numeric etc. Variable categories come under 2 types. Numeric and categorical. Numeric variable are variables whose values can be expressed in large set of data. For Example student marks can be expressed from 0-100. Categorical variable type values can be expressed in finite set of values like student gender.
Univariate Analysis
It’s a descriptive analysis of a single variable. The purpose of univariate analysis is to describe the characteristics of a single variable into a few understandable numbers.
We use below measures to describe it.
1. Frequency distribution
2. Central tendency
3. dispersion
Frequency distribution
It’s the number of times the characteristic of a variable is observed in a sample.We use histograms and box plots for numeric variables and bar chart for categorical variables to see frequency distribution. Below are histograms with distributions for our sample variables.
Skewness
It’s the measure of symmetry in the distribution. Skewness measure the relative size of 2 tails. For Normal Distribution Skewness is 0.
Kurtosis
It’s the measure of peakedness or flatness in the distribution. Its 0 for Normal distribution.
Central tendency
It’s a central or typical value of a distribution. We do measure it by below.
Mean: The arithmetic average of distribution.
Mode: The most typical value in distribution.
Median: The balancing point in distribution.
Dispersion
It’s an extent to which distribution is deviated from appropriate measure of central tendency. We do measure it by below
Range: The difference between Maximum and Minimum values in distribution
Max & Min: Maximum and Minimum values in distribution.
Inter Quartile range: difference between values at 25% and 75% points in distribution.
Variance: The measure of spread of a sample. It’s the average of squared deviations of the values of samples from its mean in the distribution.
Standard deviation: its square root of variance. So what is the difference between variance and standard deviation? Variance measure how far individuals in a group are spread out. Standard deviation measures how much observations in a dataset differ from its mean.
For a normal distribution 68% of distribution is located within 1 standard deviation of the mean, 95% is within 2 standard deviation and 99% is within 3 standard deviation.
Bivariate Analysis
Bivariate Analysis is to find out the relation between two variables. These are 3 types.
Continuous and Continuous
Generally we use scattered plots to indicate the relation between 2 variables. But it won’t tell you the strength of their association or disassociation. In statistics we use Correlation to calculate the relation between 2 variables. Generally Correlation varies from -1 to 1.
-1: Strong -ve Correlation
0: No Correlation
1: Strong +ve Correlation
So here we can see there is strong +ve correlation between TEMP vs MAX and MIN.
Categorical and Categorical
We use Two-way table, stacked column chart and Chi-Square test for this analysis.
Chi-Square test
This test also shows whether evidence is strong evidence in the sample is strong enough to generalize for larger population or not. Chi-square is based on the difference between the expected and observed frequencies in one or more categories in the two-way table. It returns probability for the computed chi-square distribution with the degree of freedom.
Probability of 0: It indicates that both categorical variables are dependent
Probability of 1: It shows that both variables are independent.
Probability less than 0.05: It indicates that the relationship between the variables is significant at 95% confidence
Continuous vs Categorical
While exploring relation between categorical and continuous variables, we can draw box plots for each level of categorical variables. If levels are small in number, it will not show the statistical significance. To look at the statistical significance we can perform Z-test, T-test or ANOVA
I will cover remaining process in my next article
Thank you
Good explanatory article. 👍🏻