The Integrity of Data

The Integrity of Data

What's the biggest financial impact you could have on your business with data? You might be surprised at how simple it is.

I started to write another article about data, but then the whole error, reconciliation, definition, documentation part took over so I thought I would write an article about "data integrity".

This is important; IBM estimate that:

Let's be clear: the opportunities on Big Data are not in creating some models, or some clever analysis (all worthless if the data is wrong), it's in getting the data right. It's in what I like to call data integrity.

Everyone is involved in big projects with data these days but how much of this is just shoving any old data into a central warehouse and calling it a Lake with minimal check box governance? The governance around data shouldn't be just a tick box, it isn't just essential, often it is the actual hard part, the part that really needs thinking about.

And the thing is, it doesn't matter whether your data is technically "structured" or "unstructured" it still needs to be organised. Whether or not it is in SQL or NoSQL it still needs to be well defined. All of it has to be checked and well maintained.

Error checking

How do you know if data is correct? I'm not sure you can ever be sure but there are ways of increasing confidence:

  • Frequency analysis; do you have a lot of dates of birth on a particular date (e.g. 1st January 2001) - it could be a system default for missing values or some other issue. The chart is one example, there's lots of ways of doing this.
  • Using data from the master system, or host system, as close to the original source as possible ("Golden Source").
  • Checking the data doesn't have impermissible values - like a date of birth of 1800 or even 2020!
  • Formats and units need to be checked. America has a different date system to the rest of the world, most of Europe treats the comma and dot the other way round to the UK in numbers, etc..
  • Reconciling from different master sources (just make sure you're not reconciling against copies of the same master source).
  • Checking a sample of actual accounts, looking at the data from manual sources (e.g. applications) and checking it matches.

If there is any data manipulation or ETL process involved, normal checks around code apply, especially: reviewing logs, reconciling input with output (number of accounts and account level), good documentation, and second analyst checks.

Definition

What is the actual definition of the data you are using? This is important - an internal system interest rate is not necessarily the same as the customer's APR. Neither is "incorrect", it depends on what you need it for. When doing analysis you need to make sure that you are using the right data. When building a data warehouse it needs to be clearly labelled and defined.

Documentation

All this leads to documentation and meta data. The data needs to be well structured to begin with but it needs really, really good documentation. The documentation should explain what the permissible values are, what format the data is in, deal with any possible inaccuracies or errors, have a clear definition, and explain how the data was sourced. If there are periods or dates when the data is known not to be accurate it should state this clearly.

Maintenance

The second law of thermodynamics says entropy [i.e. waste / disorder] always increases. This rule applies not just to physics - for example if you don't maintain your garden weeds appear, the lawn grows and it becomes a mess.

This rule also applies to data warehouses (or lakes if you prefer the term). Left on their own, increasingly the data coming in will contain errors, have a changed definition or be missing altogether. This is because systems change (and their data with them), people change definitions on systems / data to suit their new project, some systems go offline and their data with them, and manually sourced data can be affected by operational process changes.

Just like in a garden the answer is to maintain it. Repeating regularly the original checks, putting in good change management across the business, ensuring that the data documentation is reviewed as part of changes to see if it will be affected, simple batch level controls to detect any big increase or decrease will all help ensure the hard won data integrity is preserved.

This should be the top priority for any company. Data is important, and companies that have a good, strong history of the relevant data will be able to leverage their knowledge. I would suggest that this shouldn't be a must do for the data strategy; until it is completely embedded, it should be the data strategy.

A nice practical summary of the core elements and principles of BCBS239

To view or add a comment, sign in

More articles by Jason McKee

  • With Data Small is Beautiful

    Big data has arrived and it's here to stay. With ever cheaper storage and processing power comes the ability to save…

  • Digital Strategy is just Strategy

    I had a chance recently to do some work on digital collections and I was left wondering What is all the fuss about? I'm…

  • The future of Recoveries MI

    Traditional post write off recoveries Management Information (MI) focuses quite rightly on vintage based metrics which…

Others also viewed

Explore content categories