With Data Small is Beautiful
A big lake in the distance with a puddle close by. Photo by John Silliman (@john.silliman.)

With Data Small is Beautiful

Big data has arrived and it's here to stay. With ever cheaper storage and processing power comes the ability to save more and more data. Companies rightly try to centralise this data in a "Lake" creating a verified source for all analytics projects.

Of course you never know what data you might need so the temptation is to store all of it. The trouble is a little bit of the right data is more useful than a lot of the wrong data, or even a lot of data of which only a small amount is what you need. The expression "needle in a haystack" is as relevant today as ever.

Google, one of the most successful companies this century made its business taking a lot of data and turning it into a small amount of useful data. By delivering a page with ten search results out of a web of over a billion sites they literally find ten tiny needles in the biggest haystack imaginable. Well, OK not all the time, but enough times to be useful.

I've heard companies boast of using hundreds of variables in a decision model as if that is a good thing. Well guess again - it isn't. After the first few most predictive variables, the additional ones likely aren't adding value, either because they are worthless or because they correlate with existing variables. They do add complexity, processing time, and reduce transparency.

Of course the 10 useful variables for one model might not the same as what you need for another model. Fair enough lots of departments needing a little bit of different data can soon add up (though I would think there would be a lot of commonality in terms of data needs). You also might need data outside of modelling - try telling the regulator you deleted data they wanted!

Finally, requirements and outcomes change, and it is useful to be able to model historically (imagine being able to model all of the recessions in the last hundred years not just the last one).

So, assuming you've taken care to verify the data you are putting into the lake is correct; have put in batch and record level controls to assure it will continue to be correct; have clearly defined in a data dictionary every field, how it is calculated and the range of possible values; have included additional meta data allowing users to find the relevant information; have reconciled information from different sources; have established processes to ensure changes to the source data don't affect the lake downstream; and all the other many things you need to do to have a useable data lake then it will be an absolute dream to use. If you're just shoving data willy nilly into some database with minimal check box governance then it will be nightmare.

So here's my suggestion: when you build your Big Data Lake (which is going to happen come what may) don't forget to keep a small data puddle with the data you actually, really, need. Take real care, to keep it organised, well defined, clear and well maintained. Good documentation for the small amount of data you normally use, with decent error checking and reconciliation can save a lot of pain down the road. If you start now you'll probably finish the puddle before you complete the lake. Of course if you take real care with the data puddle you might find it so useful that you decide to replicate some of the governance for the Lake and end up with something more useful. More importantly though, you will have the most important data in one place.

To view or add a comment, sign in

More articles by Jason McKee

  • The Integrity of Data

    What's the biggest financial impact you could have on your business with data? You might be surprised at how simple it…

    2 Comments
  • Digital Strategy is just Strategy

    I had a chance recently to do some work on digital collections and I was left wondering What is all the fuss about? I'm…

  • The future of Recoveries MI

    Traditional post write off recoveries Management Information (MI) focuses quite rightly on vintage based metrics which…

Others also viewed

Explore content categories