Agile Data Modelling with Data Vaults

One of the most common challenges faced when implementing Agile methodologies in a BI environment was data modelling.

      Q) Are your BI projects affected since data modelling cannot adapt to the            agile methodologies and need to be completed first?

      Q) Does your project team wait for the data modeler to build/extend the              logical model before they start working on the ETL and Data Warehouse?

      Q) Do you agree that the EDW needs to be continuously adapted and                    extended as we end up with more entities/ attributes and sources (read ‘Big       Data’) in your organization?

So the big topic is how do we ensure that your EDW is flexible, adaptable to change, yet Agile/Fast to market?

One of the smart ways to cope with this is by building using an Agile Data Vault. Agile Data Vault builds up the Data Warehouse incrementally while maintaining the flavors of an EDW.

A Data Vault consists of three main types of entities/objects –

  1. Hub – This is the core entity to identify the Hub. This only contains the key columns for the entity and a date when the record was entered.
  1. Satellite – This is linked to a Hub and provides more descriptive information about the entity.
  1. Links – This is used to link together two Hubs, and also allows tracking of historical changes between entities.

The Data Vault methodology fits in smoothly with the Agile/Scrum way of working allowing user stories to be built incrementally (using Hub-Link-Satellite).

This allows the other team members in the Agile team to build the ETL and Semantic Layers end to end working collaboratively in a sprint providing business value.

                   (Example method of building a Data Vault in incremental sprints)

Summary – Data Vaults are an effective way to build a Data Warehouse that can

  • Be built in an Agile and Incremental fashion
  • Is adaptable to change..
  • Integrate data from multiple sources
  • Can store change tracking data

 About the Author – Siddharth Rajagopal is a BI and Scrum Enthusiast working at Cognizant EIM with a BI experience of more than 8 years working in different layers of the BI chain. You can reach him with your comments and questions at Siddharth.Rajagopal@Cognizant.com

Data Vault (1.0)* is "agile friendly". Data Vault 2.0 mandates Agile practices with DAD (Scott Ambler's Disciplined Agile Delivery) is embedded in the architecture. You can use many different agile techniques, or rightly pointed out even non-agile techniques with DV 1.0. The important thing though is to go all the way to the user-facing constructs in a sprint cycle where you have an objective to meet. I've written about this in detail here https://www.garudax.id/pulse/agile-data-warehousing-using-vault-architecture-sanjay-pande * Data Vault is now deemed Data Vault 1.0 by it's author and inventor.

Holly! Agile in the context of development metholody (OLAP or OLTP) should be always used as a noun and never as an adjective... when you use "agile" as an adjective it is pure BS sales-pitch... Agile EDW means as less as beautiful EDW or colorful EDW... BS! As well... the idea that only EDW's constructed based on Agile projects grow incrementally is absolute non-sense. Anyone can build an EDW incrementally (and nowadays everybody does!) despite of which methology is adopted to the projects... or if the program manager and PMO have any intellectual capacity... multiple methodologies will be used depending on the objectives and constraints... Another as well... despite of how the data vault is presented... there is absolutely no constraint about developing your EDW following the Data Vault modeling principals with other methologies that are not Agile... As a matter of the fact... in my opinion... I believe that backend EDW projects are far more suitable for iterative waterfall than to Agile... I find Agile excellent for the frontend... and guess what?!... yes... you can have a iterative project where a few phases are waterfall and others are Agile-Sprints... including parallel iterations... I HOPE I DIDNT BLOW ANYONE'S MIND!!!

Like
Reply

Hi Sid, I am glad to read this interesting article from you. Been a while we caught up. Here is my line of thought.... You have correctly picked up 1 business challenge (faster time to market) and addressed it with a recently invented Information Management technique called "Data Vaults" whereby justifying why Data Vaults can help us deliver agile projects. I think it is a very good attempt & thought line and definitely helps some use cases & customers. A couple of thoughts when you start comparing enterprise information modeling & data vaults in the context of agile delivery 1- An enterprise information model can also be delivered Agile just as a Data Vault would. 2- A customer dimension (from your example) is also built incrementally in an industry model based EDW (e.g. Teradata cLDM, IBM's TDW, Oracle's OCDM) via the History Tables (Links) where Master Tables (Hubs/Sat) are also built very quickly like Hubs. You keep connecting more and more Master Tables (Hubs) to other Master Tables as you keep deploying them (adding new sources of Master Data) via History Tables (Links) 3- You example of Customer assumes a 1-1 relationship between a source provided data set containing a customer-id and a "real" customer defined by the business rules of that particular Enterprise (which you refer to as customer dimension). In such an example, we can say that "Customer Dimension" is indeed being built incrementally. But this is hardly the case at any medium to large scale enterprise. Usually, business would define a "Customer" based on a business rule either encoded in the semantic layer or via Logical model relationships in an industry model. In Data Vault this concept is not encoded anywhere thus created via a Data Mart or a Semantic Layer on top of data vaults. In this case of course the customer dimension (in SL / DM) built on Sprint 1 is not the same as after Sprint 10 meaning several sprints will add / modify / delete some components in the demoralized "Customer" dimension in the data mart / semantic layer. This invalidates the assumption that a business usable dimension can seamlessly be created incrementally within a data vault via sprints independent of each other's outputs. Resulting in exactly same challenge as faced while creating a business dimension via an industry model (i.e. To create a significant enough view of a customer in the first go to avoid reworks / reinvestment later in a given dimension) 4- I think the most important point is that we should not try to find an Information Management technique (Data Vaults, EDW, Dimensional Star Schemas) that can be delivered via a project delivery method (e.g Agile). We should look at it the other way round. Which IM (Dimensional, Enterprise Industry Models 3nf, Data Vaults) & EAI (Federation, Mediation, virtualization) supports the most needs of the enterprise business & operating models. And finally for that IM practice & EAI technique which project delivery methods support the best results (time to market, quality, strategic direction etc). In my opinion this is the only way we can support business functions and enable them to grow. 5- The key difference in an industry model based EDW and a Data Vault driven DWH is the responsibility of creating a business consumable information layer (Semantic / Presentation layer). In Industry models, you already embed a lot of business glossary and while presenting it to the business users you create enterprise wide common semantic layers / presentation layers (views / tables) using the standardized entities in the industry model you already deployed in 3nf layer. Thus you take time to align, agree on those business definitions to be able to do this job. In Data Vault approach, you only build a technical layer (hopefully via metadata driven generation techniques and not manually via developers) and let the business teams create the Semantic layer themselves (hoping they are enough SQL / data savvy people. So its about who's doing what and where is the integration complex that defines the differences in Data Vaults vs EDWs. Both approaches have different uses in my opinion and can not replace each other in entirety. But this is just my opinion and I am learning new stuff every day from great people I work with (like yourself)

Like
Reply

Hi Sidd, I agree with you that choosing the modeling method(s) depends on the scenarios and the business functions the Data Warehouse has in scope. If fast flexible data loading and intra business function integration is sufficient Data Vault beats all alternatives, even source virtualization. As you stated a Data Vault model can be incrementally extended, because the impact spread of connecting a new source will be minimal and the model covers the actual scope and nothing more. At the other end the reporting and analytical models can also be incrementally designed in an agile way. These reporting views can be added without much impact on each other. However the reporting views may have complex data requirements, like inter business function integration, complex data transformations and calculations. The physical Data Vault model has the data sources in mind and may not be a good candidate to facilitate these reporting requirements. I always struggle how to solve this, not modeling leaves the complexity to code but explicitly modeling may be hard to break down in increments because the impact spread is large if this model needs change when the second increment passes by.

Like
Reply

Hi Sidd, A Data Warehouse is a set of functions that close the gap between registration systems and analytical systems. In practice Data Vault addresses only two of these functions, i.e. source driven light data integration and historize loaded satellite data. The source integration function needs a model, the Data Vault model, that is data source driven and covers the main business keys and associations. Even this source modeling exercise is difficult to do incrementally because all possible sources should be taken into account to determine those concepts that are important enough to be a Hub or a Link, the skeleton for the incremental satellites. The remaining gap between the Data Vault model and the reporting/analytical models can still be large. Closing this gap with ETL or views leads to complex code and direct usage of the Data Vault is bad performing and confusing. Modeling an intermediate model (a Business Vault) to close this gap brings us back to very time consuming enterprise modeling. So I agree with Srinivasan that agile Data Vault will not deliver all Data Warehouse functions and the remaining functions need a business model that is hard to design incrementally.

Like
Reply

To view or add a comment, sign in

More articles by Siddharth Rajagopal

Others also viewed

Explore content categories