Exploratory Data Analysis

Sthiti Mishra

Published Sep 25, 2018

A quick snapshot about Exploratory Data Analysis (EDA)

- EDA is done to convert available raw data to an informative one, in which the main features of the data are illuminated.

- When performing EDA, the steps we take:

use Visual Displays along with Numerical Summaries
describe overall pattern to mention any deviations from regular pattern
interpret the results of the context

- For Single variable distribution, we distinguish between categorical and quantitative variable

Categorical Variable

- Display: pie-chart/bar-chart

- Numerical Summary: category group percentages

Quantitative Variables

- Display: histogram (or stemplot - for small datasets)

- Overall Pattern - shape, center, spread

Deviation from Pattern - outliers

- Numerical Summaries: Descriptive Statistics

For Symmetric Distribution with no outliers - mean and standard deviation(SD)
For Others, use median and Inter-Quartile Range (IQR)

- Five-number summary(Min,Q1,Median,Q3, Max) and 1.5(IQR) is used for detecting outliers and are required to build a Box-Plot

- A special case of distribution is Normal distribution, where the Standard deviation rule applies. This rule tells us what percentage of observations fall in 1,2 or 3 standard deviation away from mean. For Normal Distribution, most of the observations(99.7%) fall within 3 standard deviations from the mean.

- While examining the relation between two variables, first classification of variables according to roles and type is required before determining the right tool for summarizing the data. Lets classify Categorical type variable as "C" and Quantitative type variable as "Q". Here Response variable is the dependent variable/predicted variable and Explanatory variable is the independent variable/predictor variable.

Case 1: Explanatory Variable: C & Response Variable Q:

Display: side-by-side boxplots.

Numerical summaries: descriptive statistics of the response variable, for each value (category) of the explanatory variable separately.

Case 2: Explanatory Variable: C & Response Variable C:

Display: two-way table.

Numerical summaries: conditional percentages (of the response variable for each value (category) of the explanatory variable separately).

Case 3: Explanatory Variable: Q & Response Variable Q:

Display: scatterplot. For scatterplot, following considerations are important:

Overall pattern → direction, form, strength; Deviations from the pattern → outliers

Numerical summaries: the correlation coefficient (r) measures the direction and strength of the linear relationship. The closer r is to 1 (or -1), the stronger the positive (or negative) linear relationship. r is unitless, influenced by outliers.

Exploratory Data Analysis

Sthiti Mishra

Others also viewed