Predictive Modeling Forensics - Finding Data Problem
The Data Understanding stage of a predictive analytic project is intended to uncover the characteristics of the data available for predictive modeling. One key part of Data Understanding is what we might call a Data Audit, where every field is summarized. One purpose of a data audit is to uncover potential problems with the data that should be corrected during data preparation.
Why should a modeler take such care in summarizing and investigating the data? And why does the process of Data Understanding and Data Preparation take so much time? One reason is that the predictive modeler is often the first person who has ever examined the data in such detail. Even if the data has been scrubbed and cleaned from a database storage perspective, it may never have been examined from a predictive modeling perspective (and these are different). Some values may be “clean” in a technical sense because the values are populated correctly and accurately according to the data collection specification. However, they may not communicate the right story informationally.
Questions about Your Data
Once you solved the below question, you will get your answers...
- How many missing values are there? Do any variables have mostly or all missing values?
- Are there strange minimum or maximum values?
- Are there strange mean values or large differences between mean and median (indicating skew in the distribution)?
- Is there large skew or excess kurtosis? (This matters for algorithms that assume normal distributions in the data.)
- Are there gaps or holes in the distributions, such as bi-modal or multi-modal distributions?
- Are there any values in the categorical variables that don’t match the dictionary of valid values?
- Are there any high-cardinality categorical variables?
- Are there any categorical variables with large percentages of records having a single value?
- Are there any input variables with unusually strong relationships with the target variable, possibly indicating leakage of the target into a candidate input variable?
- Are any variables highly correlated with each other, possibly indicating redundant variables?
- Are there any crosstabs that show strong relationships between categorical variables, possibly indicating redundant variables?