A Big Data Reference Architecture
So we arrived in the area of Big Data. And everyone is talking about Spark, Data Lakes and why we should have a Data Warehouse anyway.
And there’s no shortage of technology. Every day new techniques arrive with fancy names as Splunk, Hunk, Hive, Pig and so on. But how to design an effective Big Data Architecture?
A so called Reference Architecture has created order in many domains, ranging from applications to Business Intelligence. This article will suggest a Reference Architecture for Big Data.
Let’s start with looking at a well-known reference Architecture for regular data, the Business Intelligence and Data Warehousing Architecture. It has been argued that Data Warehousing now has become obsolete. I rather follow the strategy to embrace and enhance what’s been proven useful.
In summary, a modern BI Architecture can be recognized by three main characteristics:
This BI Architecture has matured over the last 25 years and now reached the phase of data warehouse automation. So why not use this Architecture and see how it can be extended to cater for Big Data as well?
A lot of “Big Data” experiments turn out to be “Regular Data” projects using predictive modelling. These could be built on top of any modern BI Architecture. The big difference is the creation of a (huge) Analytical Base Table that could require more advanced data preparation than usual. Data Mining tools typically work from such a big flat table to discover the patterns in it:
This Architecture holds until the real “Big Data” comes in. Volume, Variety and Velocity are the reasons why HDFS, Hadoop and Spark originated. So let’s use these technologies at the “Staging” and “Register” areas of this Reference Architecture. The Data Lake becomes the “schema while reading” equivalent of the “schema while writing” Data Vault. This way the Big Data Reference Architecture now looks like this:
It does not only show the BI and Big Data Architecture in a complementary way, they also share value in the form of the BI Dimensions being presented to the Analytical environment. Reversely, JSON objects from the Data Lake can be parsed and their information added to the Data Vault.
Technologies now can be positioned in the areas of this Reference Architecture where most effective. Also data modelling techniques can be positioned in the appropriate areas. Zooming into the core of this Architecture on data modelling reveals interesting complementarity and similarity. The four area’s seem to be complementary and complete enough to hold any data in a form or shape needed in both BI and Big Data:
But there’s a similarity in the Data Vault and Analytical Base Table being both Subject oriented structures. Who will tell if there’s a direct way to feed the Analytical Base Table directly from the also Subject oriented Data Vault?
There’s so more to tell about this Big Data Reference Architecture, but for the purpose of this post I’ve been as as briefly as possible. Let me know how it works for you in designing an effective Big Data Architecture!
Update in 2020: In terms of solutioning Big Data Architectures, I was impressed by what Snowflake has to offer. Please find my follow up article here.
"All the world is my Staging" and thats just the way it is Martijn Imrich 😍
Marlot Schuuring
I think this blog post is a gratifying initiative to put the pieces of the Big Data puzzle into place. The current evolution in the Big Data & Analytics space is going fast and it is difficult to keep up and retain an oversight on the new technology players in the field. Our ba4all.nl conference evaluation feedback indicates more and more need for demystification. It is my experience and belief that many companies struggle with putting together the business case for Big Data projects. How to start (small)? How to integrate with our existing BI infrastructure? I am convinced that a reference architecture is an act of good governance, for sure. But not many companies are there yet. Far from it. It will take time and it will take investments. Meanwhile they’d like to be able to jumpstart analytics initiatives to gain trust and thus more investment appetite. So their ability to quickly support those particular analytics workloads without having a reference architecture yet, will be key. Having said that, in your model, I may seem to be missing cloud components, especially when you talk about Agile, and the case for the Big Data Refinery. I am convinced that the latter will fuel the next generation data architectures. I would love to debate further on this one Martijn! Perhaps on November 15th in Utrecht? http://tinyurl.com/hyrjnoc
More than 30 years ago I was responsible for the development of dealing-room systems. These systems are Currently the most afvanced big data implementations om earth.
Martijn Imrich Are we going back to the traditional BI eco system with modern hardware and separations? We are moving towards self service, real time data analytics but I'm not sure the given architecture will support them or it is not indent that way as it will not be dynamic in nature. This always requires team to build the models, define dimensions, code to integrate new data sources.