Data != Fact (and what to do next)

Data is tricky stuff. It may look like reliable fact, but subtle differences in the lineage of that data can make an enormous difference to what you can infer from it. This note examines the main steps in the data’s journey and then suggests some things you can do to be confident that your analysis makes sense. In each case you need to think about what you are trying to get the data to say, and if that is valid given the way you know it has evolved.

Creation

  • Why was the data created?
  • What did the creator want to use it for?
  • If the data was created by a number of different people, did they all have the same understanding?
  • Could they have understood that the data would be used in the way you were intending? Is your intended use case compatible with the original intention?
  • Was the data created once, or was it created periodically?

Recording

  • How was the data recorded or digitised?
  • If it was recorded manually, did all those doing it share the same incentive to record it correctly and conscientiously? Think about exactly what they did to record it. Were there distractions?
  • Was any data discarded when the data was recorded? Was this censoring random or was there a systematic bias?

Transformation and storage

  • Did the data need to be transformed to fit into the schema in the data warehouse?
  • Has the data been updated, appended or overwritten since it was recorded?
  • Are you using the most up-to-date data?
  • Are you looking at the best information as it is known now, or the best information as it was known at the time it happened?
  • Has the data been aggregated, smoothed or binned to convert data at a point in time to data over a period?

Extraction

  • When you get the data you make assumptions about how it should be fitted back together.
  • Are you clear on what keys the data should be joined?
  • Are you clear that the date column you are using is the right one?
  • When there is a Null cell or no match, does this mean that there is no data, that the data wasn’t recorded, or that data doesn’t make sense in this instance?

Analytics

Obviously any analytics you do on the data is steeped in assumptions and bias (that’s just the nature of analytics). Even innocuous things like adding up a column can determine what inference you can draw from a number.

Given all this, what can you do?

The best first step for data validation is to speak to the original creator of the data and the person currently creating it. If you don’t have access to them, speak to a current user of the data. That means someone who uses the data to provide information to management decision makers. Less useful generally is the person administering of the data although they may have some insight on how the dataset is put together.

From this you will be able to understand the lineage of your data back to the source system, and also what automated testing in already in place to verify the quality of the data.

You should always do some exploratory analysis on the data and sense check it with business experts. Often it is helpful to visualise it or to calculate metrics and ratios so that they can tell you if the information looks sensible to them. It can be useful to present them with the max-min ranges of the numerical fields, and to ask them is there are any features in the data that they would expect to observe (eg field x is seldom more than twice field y) so you can check and see if they are there.

One of the biggest sources of error in analysis of data is choosing the wrong date column. Often a data table may have multiple date fields for every record, and even if they are labeled it can take quite a bit of thinking to work out which to use as your time stamp. You might also want to use a different time stamp depending upon what question you are trying to answer

It is worth looking at your data over time, across regions and across products to ensure that it is consistent. Maybe they record when a sale has happened differently in different regions or there could be different assumptions about inventory between different subsidiaries.

Make sure you can reconcile your data or analytics back to a trusted data source.Do the sales volumes form your data reconcile to the volumes in the accounts? Is the number of records in your analysis equal to the number of records in the source data?

Is the analysis you are doing statistically valid to make the inference you want? Have you got enough data points, not just overall but also for each bin. As you increase the specificity of the bins (eg by region, by product, by month) you dramatically reduce the amount of data from which you have to reach a conclusion. What is the variability in the data? Sometimes you want variability, sometimes you don’t. How sensitive is your conclusion to the time window that you have chosen for the analysis? Where you have calculated an average, be careful to choose the right one that supports your argument Median, mean, mode, max, min each tell a different story about your data.

Finally, does the data analytics you have used truly answer the question you have asked, even assuming that that question was the right one in the first place?


indeed good points, but I find sometimes we are doing selective filtering where we are choosing how to represent the data in order to prove a point we want to hear... how to achieve true objective analytics 

Like
Reply

So good! Bravo Harry...

Like
Reply

To view or add a comment, sign in

More articles by Harry Powell

Others also viewed

Explore content categories