Data Warehouse Evolution: Data Vault - Principles for Full Data Traceability
It's shocking how much is spent on data warehouses, simply managing change.
If data is meant to be 'The New Oil', then analytical data warehouses are meant to be the oil drilling rigs, tapping the far reaches of the known enterprise, mining for insights to improve, optimise and innovate? So why, on average, are we not seeing an explosion of hyper-connectivity of data from forward-thinking businesses, eager to tease out known (and unknown) unknowns?
In my opinion, one word: Agility
Data architecture fashion hasn't moved on much as the data industry is still considered a bit of a dark art, existing somewhere in the basement where sun-starved acolytes are led by small teams of high priests, directing the life-blood of the organisation in mysterious ceremonies to update business systems.
It really is no big mystery, we've just allowed it to become that way.
Agile Data Warehouses have been around for a few decades now, and are business led. The new fashion in the data architecture world, is to model data around business concepts first, as this is more intuitive to everyone, and allows for some very simple magic to occur: separation of concerns.
A data pioneer named Dan Linstedt found himself in a position to rethink the way data managed, out of necessity. He had very large amounts of very quickly changing data to manage on behalf of large american government organisations, and had to drastically rethink the entire approach, from the ground up.
The result? Data Vault.
The concept is simple. Business defines the things that it is interested in, which generally are well known and discussed concepts. These concepts are usually central to the operation of a business, like 'product', 'customer', 'incident', 'request'... you get the picture. All of these business concepts are so important that they are usually given reference numbers so they can be referred to easily. Instead of referring to customer 'David', I get a business reference. This becomes the starting point for any data collected about me.
In the same way, 'Hubs' are collections of business references. My customer ID is stored with your customer ID, and together, this 'list' of IDs can reference all the different information that is held about you and I. Have a new system? If it contains a business reference, just point to it. Presto, Agility.
Now back to why this change of view is important.
As in the image I attached to this article, above, the old method of managing data for analysis fits the data around something called a 'fact', with 'dimensions' surrounding it. This isn't a bad method, but it was designed for stability and simplicity more than agility. They are very useful for business intelligence bods, like me - and also for tool standardisation - to query information ... but not for managing data. That was an after-thought. In the diagram on the left, you can get a picture (literally) of the issues of adding several different sources that have totally different data structures, quality, breadth and depth of data. And then you have to manage change! Crazy... it's a long and laborious job to create stable versions that are consistent across reporting periods.... well you probably know that, because it is the general experience today.
Not with Data Vault. The separation of concerns, led by business, means that the structure for organising data is different from the structure of collecting it. As in the diagram on the right, you can see that the exact same sources are simply separated and managed by their business references (we like to call them Business Keys) within Hubs. Each business source is basically attached to the Hub in something we like to call Satellites (they sort of revolve around the Business Key... it's a good analogy anyway).
The diagram above is of course not a complete picture. There are all sorts of ways to take this basic structure and manipulate it with simple queries until you shape the data how you want it... even making it into the image on the left again - but this time, agile and dynamic!! It will include any new data you include in the satellites that it is interested in, but it will never see any of the other satellites that it is NOT interested in. The structure guarantees it.
Anyway, once you take the step to organise your data in a more flexible and agile way, you can focus your energy on making it useful. The way that Data Vault data is structured makes it a bit of a 'data pump', providing instant access to anything you hook into it (hint: even IoT amounts of data!) - this is possible in any old relational database you have, or even the new data storage technical solutions... the Data Vault is simply a disciplined method, not some whizzy new installation. Once you 'get it' you will want to use it everywhere!
If you are interested in learning more, why don't you come along to one of our Meetups if you happen to be in Brussels. If you want to have a chat about how Data Vault can enable you to skyrocket your data into more useful activities like #GDPR, #MasterData, #LocationIntelligence, #AgileDashboards in the #AgileDataWarehouse and even organising #DigitalTransformation then I hope you will visit us @dFakto (www.dfakto.com) to discuss.
For us, data is beautiful and useful.