Predictive Modeling Forensics - Finding Data Problem

Ravi Saxena

Published Oct 29, 2015

The Data Understanding stage of a predictive analytic project is intended to uncover the characteristics of the data available for predictive modeling. One key part of Data Understanding is what we might call a Data Audit, where every field is summarized. One purpose of a data audit is to uncover potential problems with the data that should be corrected during data preparation.

Why should a modeler take such care in summarizing and investigating the data? And why does the process of Data Understanding and Data Preparation take so much time? One reason is that the predictive modeler is often the first person who has ever examined the data in such detail. Even if the data has been scrubbed and cleaned from a database storage perspective, it may never have been examined from a predictive modeling perspective (and these are different). Some values may be “clean” in a technical sense because the values are populated correctly and accurately according to the data collection specification. However, they may not communicate the right story informationally.

Questions about Your Data

Once you solved the below question, you will get your answers...

How many missing values are there? Do any variables have mostly or all missing values?
Are there strange minimum or maximum values?
Are there strange mean values or large differences between mean and median (indicating skew in the distribution)?
Is there large skew or excess kurtosis? (This matters for algorithms that assume normal distributions in the data.)
Are there gaps or holes in the distributions, such as bi-modal or multi-modal distributions?
Are there any values in the categorical variables that don’t match the dictionary of valid values?
Are there any high-cardinality categorical variables?
Are there any categorical variables with large percentages of records having a single value?
Are there any input variables with unusually strong relationships with the target variable, possibly indicating leakage of the target into a candidate input variable?
Are any variables highly correlated with each other, possibly indicating redundant variables?
Are there any crosstabs that show strong relationships between categorical variables, possibly indicating redundant variables?

To view or add a comment, sign in

Predictive Modeling Forensics - Finding Data Problem

Ravi Saxena

Others also viewed

Data Analysis: Difference between Descriptive, Predictive, and Prescriptive analysis

Analysis & Modelling – Best Practices

Exploratory Data Analysis (EDA)

Power of Principal Component Analysis (PCA) in Data Analysis

Understanding Entropy: Unveiling the Power of Information in Data Acquisition and Predictive Modeling

DATA ANALYTICS PROJECT LIFE CYCLE

Getting Data ready for modelling: Feature engineering, Feature Selection, Dimension Reduction (Part two)

Steps Involved In Data Science Problem:

Exploratory Data Analysis

How Clustering Can Help You Label Your Data

Explore content categories