Exploratory Data Analysis
Data Analysis is an important part of any Model Building. It is a process of inspecting, scrubbing, summarizing, transforming and modelling the data. It takes almost 70% of the time of an analyst to clean the data and prepare the same for further analysis.
There are three popular approaches for data analysis:
1. Classical -- Problem => Data => Model => Analysis => Conclusions
2. Exploratory (EDA) -- Problem => Data => Analysis => Model => Conclusions
3. Bayesian -- Problem => Data => Model => Prior Distribution => Analysis => Conclusions
Today we will discuss Exploratory Data Analysis/EDA. EDA was promoted by John Tukey in 1961 and defined as “Procedures for analyzing data, techniques for interpreting the results of such procedures, ways of planning the gathering of data to make its analysis easier, more precise or more accurate, and all the machinery and results of (mathematical) statistics which apply to analyzing data". EDA is a second step of Predictive Modelling after Hypothesis Testing. The goals of EDA include identifying relationships between variables that are particularly interesting or unexpected, checking to see if there is any evidence for or against a stated hypothesis, checking for problems with the collected data, such as missing data or measurement error), or identifying certain areas where more data need to be collected.
Various steps involved in Exploratory Data analysis.
1. Variable Identification- In variable Identification we try to identify
- Type of the variable: Predictor/Explanatory or Response/Target Variable
- Data Type: If the variable is Numeric (Continuous) or Character (Categorical)
2. Univariate Analysis – Here we analyze one variable at a time by finding their Mean, Median, Mode, Std. Deviation, Variance, maximum, minimum, quartiles, range, count, Count%. There are various ways to visualize data through univariate analysis:
- Frequency Distribution Tables.
- Bar Charts.
- Histograms.
- Frequency Polygons.
- Pie Charts.
3. Bi-variate Analysis – The phenomenon of analyzing two variables simultaneously. It deals with correlations, causes or relationships.
- Missing values treatment – My data goes missing. Yes it happens and this is the harsh truth of a researcher’s life. If the missing values are not handled properly by the researcher, then he/she may end up drawing an inaccurate inference about the data. But with elaborate analysis comes complexity, one of which is handling and treating missing data (blank/NA). Before discussing Missing values treatment, one need to know the possible reasons of missing values:
a) Data Collection: These errors occur at time of data collection and are harder to correct. First, determine the pattern of your missing data. They can be further divided into 3 types:
- Missing completely at Random (MCAR): This happens when missing values neither depends on observed or unobserved variables. There is no pattern in the missing data on any variables. This is the best you can hope for. For e.g. People have not filled the survey forms completely due to lengthiness of the form.
- Missing at Random (MAR): This happens when the when variable is missing at random. There is a pattern in the missing data but not on your primary dependent variables. For e.g. Respondents in service occupations less likely to report the income.
- Missing not at Random (MNAR): There is a pattern in the missing data that affect your primary dependent variables. For example, Respondents with lower-income are less likely to respond and thus affect your conclusions about income and likelihood to recommend. Missing not at random is your worst-case scenario.
b) Data Extraction: It is possible that our data extraction process is not so sound. There is a possibility that while loading/reporting the data in the system manually, the person-in-charge can make errors. In such cases we check with the data owners for the correct data. Errors and missing values are comparatively easier to deal at this stage.
Once you know the cause of your missing data, it becomes little easy to impute/treat the missing values. They are various ways to deal with missing data. Some of them are covered below.
1) Deletion Technique : This is a approach where we “remove” variables or observation with missing data. There are broadly two type of deletion
· Listwise Deletion: In this case you can delete the observations where the variables are missing. If your sample is large enough, then you can drop data without substantial loss of statistical power. This method is used when the data is MCAR.
· Pairwise Deletion: In this case we do the analysis with all cases in which the variables of interest are present. The advantages of this method are it use all the information and keep as many cases as possible for each analysis. The only downside of this method is it can’t compare analysis because the sample is different each time.
2) Imputation: This is process where we “fill in” or “impute” missing value with substituted values. There are various methods which are used for Imputation; some of them are listed below.
- Mean, Median & Mode Imputation: This is obviously not the first choice for imputation but proved to be a simplistic imputation strategy. Here the missing values are replaced with mean/ median in case of continuous data. For categorical variables, we use the mode (the most frequent value) as the default to fill in for the otherwise missing values. If the data does exhibit some skewness though (e.g., there are a small number of very large values) then the median might be a better choice.
- Regression Imputation: As the name suggests, here the missing values are replaced with predicted values from the regression equation. Researcher need to fit a regression model by setting the variable of interest as response variable and other relevant variable as covariate. The coefficients are estimated, and then missing values can be predicted by fitted model. This model use information from the observed data but sometimes overestimates model fitness.
Multiple Imputations: This is an iterative process of probabilistic estimating missing values based on observed information from across your data set. The great thing about MI is that not only does one get (A) decent estimates of the missing data values; one also gets (B) estimates of the increased uncertainty in one's analysis due to missing data This can be done in 3 steps:
- Impute: Here the data is “filled in” with imputed values using specified regression model. This step is repeated m times, resulting in a separate dataset each time.
- Analyze: Analyses performed within each dataset, resulting m analysis at the end of this step.
- Pool: The m results are consolidated into one result by calculating the mean, variance, and confidence interval of the variable of concern.
- Outlier treatment – In simpler words, Outliers are data values that differ greatly from the majority of a set of data. We always need to be on the lookout for outliers as they are caused by error. Their presence in the dataset indicates previously unknown phenomenon. For e.g. If you had Pinocchio in a class of children, the length of his nose compared to the other children would be an outlier.
The various sources of outliers include:
- Human error (i.e. errors in data entry or data collection)
- Participants intentionally reporting incorrect data (This is most common in self-reported measures and measures that involve sensitive data, i.e. teens under reporting the amount of alcohol that they use on a survey)
- Measurement Error: This is caused when the measurement instrument used turns out to be faulty.
- Sampling error (i.e. including high school basketball players in the sample even though the research study was only supposed to be about high school track runners)
Now the question is why is it important to detect these outliers? The easiest way to detect an outlier is by creating a graph. We can spot outliers by using histograms, scatter-plots, box plots, and the inter-quartile range.
The next step after identifying the outlier is to remove them. There are various techniques used by different analysts to treat these outliers. Some of them are:
- Deleting observations: We delete outlier values if it is due to data entry error, data processing error or outlier observations are very small in numbers. We can also use trimming at both ends to remove outliers.
- Transforming and binning values: Transforming variables can also eliminate outliers. Natural log of a value reduces the variation caused by extreme values. Binning is also a form of variable transformation. Decision Tree algorithm allows to deal with outliers well due to binning of variable.
- Imputing: This method is similar to the one we use to treat missing values. We can use mean, median, mode imputation methods. Before imputing values, we should analyze if it is natural outlier or artificial. If it is artificial, we can go with imputing values. We can also use statistical model to predict values of outlier observation and after that we can impute it with predicted values.
Nice article Vinita.