Exploratory Data Analysis

A quick snapshot about Exploratory Data Analysis (EDA)

- EDA is done to convert available raw data to an informative one, in which the main features of the data are illuminated.

- When performing EDA, the steps we take:

  • use Visual Displays along with Numerical Summaries
  • describe overall pattern to mention any deviations from regular pattern
  • interpret the results of the context

- For Single variable distribution, we distinguish between categorical and quantitative variable

 Categorical Variable

 - Display: pie-chart/bar-chart

 - Numerical Summary: category group percentages

 Quantitative Variables

 - Display: histogram (or stemplot - for small datasets)

 - Overall Pattern - shape, center, spread

Deviation from Pattern - outliers

 - Numerical Summaries: Descriptive Statistics

  • For Symmetric Distribution with no outliers - mean and standard deviation(SD)
  • For Others, use median and Inter-Quartile Range (IQR)

- Five-number summary(Min,Q1,Median,Q3, Max) and 1.5(IQR) is used for detecting outliers and are required to build a Box-Plot

- A special case of distribution is Normal distribution, where the Standard deviation rule applies. This rule tells us what percentage of observations fall in 1,2 or 3 standard deviation away from mean. For Normal Distribution, most of the observations(99.7%) fall within 3 standard deviations from the mean.

- While examining the relation between two variables, first classification of variables according to roles and type is required before determining the right tool for summarizing the data. Lets classify Categorical type variable as "C" and Quantitative type variable as "Q". Here Response variable is the dependent variable/predicted variable and Explanatory variable is the independent variable/predictor variable.

  • Case 1: Explanatory Variable: C & Response Variable Q:

Display: side-by-side boxplots.

Numerical summaries: descriptive statistics of the response variable, for each value (category) of the explanatory variable separately.

  • Case 2: Explanatory Variable: C & Response Variable C:

Display: two-way table.

Numerical summaries: conditional percentages (of the response variable for each value (category) of the explanatory variable separately).

  • Case 3: Explanatory Variable: Q & Response Variable Q:

Display: scatterplot. For scatterplot, following considerations are important:

Overall pattern → direction, form, strength; Deviations from the pattern → outliers

Numerical summaries: the correlation coefficient (r) measures the direction and strength of the linear relationship. The closer r is to 1 (or -1), the stronger the positive (or negative) linear relationship. r is unitless, influenced by outliers.

Sthiti, thanks for sharing!

Like
Reply

To view or add a comment, sign in

Others also viewed

Explore content categories