How do you go about building a Data Warehouse?
Recently a few people have asked me (from a tech perspective) how would I go about designing a Data warehouse (DWH), the quick answer is obviously ‘it depends’. Although, I tend to widen the answer to state that I wouldn’t start with a view that I needed a DWH (or any other similar tech for that matter). Firstly, I’d understand the capabilities the organisation has/wants (including strategies) along with customer expectations and trends for their markets. Build up a logical service catalogue to support that. Then figure out the technology on a per-service basis to enable those services.
Tackling all those points well I find requires both broad and deep up-to-date technical expertise and experience. I always consider the phrase “if you only have a hammer in your toolbox, everything looks like a nail”. Don’t make broad assumptions or dismiss something you might lack an understanding, and be aware there might be solutions out there you don’t yet even know exists. If you are an engineer, there is no magic here, nobody is going to come knocking on your door to tell you where to look for the best new technology, you need to proactively and relentlessly invest in staying abreast of all the big things that are happening in your profession.
So for example, given a challenge of “personalising thousands of products to millions of customers”. If all your experience was with DWH, then you might design the solution as a monthly batch process to run over an entire weekend on a DWH and buy a single server big enough to handle your immediate needs, all the while being content that it was a “safe, tried and tested solution”. If you had kept right up to date with technology, then you might say “hey, we don’t need do this monthly, we can personalise offers to customers in realtime based on their immediate context (date time, weather, news, etc), and we can scale the solution without limits”, it might be a bit riskier and the team might not yet have the right knowledge, but I like to believe that “who dares wins”.
In the above contrived example hopefully, you can see that having that technical knowledge and experience can have a profound effect on the art-of-the-possible. Technology directly affects capabilities and functionality, obviously those impacts need to make business sense, dealing with this though is usually more difficult than the technology itself. For instance, considering the above example, you might have people change issues because you’ll be doing business in that way now impacts how people do their job (or not a job).
Anyway, back on to designing a DWH. If you are looking for a short answer, then as a working assumption, I wouldn’t plan to build a new DWH if one didn’t already exist - I’d invest in a Data Lake. And if there was a DWH in place, then I’d probably still build a Data Lake. So in this article I try to highlight the difference between the two approaches, I purposely avoid talking about any particular implementation, but I’ll give examples to make it a bit more concrete.
Another important consideration is an organisation’s business strategy leading on to the technical strategy - I mean how important is data? If an organisation still hasn’t decided that there are commercial gains to be had by exploiting it, then any debate on technology in this area will likely be premature (sometimes you might be better off investing time in agreeing a data exploitation strategy first). It’s a difficult one though because, many organisation leaders will have heard of “big data” or “unlocking the value in your data”, but, what would that mean for their organisation? Is a data led approach a revenue/profit opportunity? Has anybody else “tried it” in the industry? Is operational reporting “good enough for us (at this time)”? Exploiting data isn’t cheap (both in people, tech and change) but there might be a pot of gold there, as always it’s a balance of investment and an appetite for risk. If you don’t tackle these questions, then that technology in your organisation will probably become just another cost centre - when instead people should see it as a revenue generator. A smart engineer should propose a solution that provides maximum flexibility (balanced against cost) for the future by being aware of and exploiting the latest technology.
Data warehouse (DWH)
So how do you build a DWH? Well you need to put a lot of effort into domain modelling - analysing different data sources and producing standardised models (called a schema on write) based on how you plan to extract value from that data. As data arrives in the DWH you clean it up and transform it as required. Dimensional modelling approaches (like Kimball) play a significant role in DWH efficiency and speed. It’s worth noting that DWHs typically don’t contain all the data, decisions are usually made what to include/exclude, this is largely due to limited/expensive storage and to ensure the DWH is performant, but also due to the human effort in data modelling and maintenance of that vs likelihood of value in return.
DWH technology is usually hosted on-premise, using expensive hardware with expensive proprietary software. This is usually a historical thing (before cloud tech arrived) and may also be due to the way in which applications interact with the DWH or the amount of data that gets moved around.
Because of the strict schema nature of a DWH, changes usually incur some pain, although a good design upfront will minimise it, avoidance is unlikely. The value is in pre-calculated Data Marts that everybody should use and not duplicate. A DWH is also usually responsible for both storage and processing. This is the primary reason why DWHs are both expensive and constrained because the two requirements must scale together.
It’s also important to consider who the user base is:
- Operations, e.g. reporting - Because the data is well structured it’s pretty easy for anybody to understand and report on. A mature query tool base exists, so if you have old applications/reports that you don’t want to touch, then a DWH might look good;
- Analysts, e.g. insight, trend analysis - With a DWH, they often resort to pulling different data sources together in other data tools and even spreadsheets, because the DWH hasn’t got all the data, processing ability or tooling they need;
- Data Scientists - Users doing deep analysis are usually the fewest in number but highest resource users due to the nature of their work, they might often be constrained (or completely held back) by the DWH performance as they analyse and join together large volumes of data to produce statistical models and look for hidden correlations.
Data science though is a relatively new profession, as such many organisations won’t have a mature enough data strategy to really get the benefit from it. DWHs still suit most businesses well who have not yet (or do not plan to) have a strategy of data exploitation. Also DWHs are well understood and it’s easier (and cheaper) to hire people to work with them. This is why some organisations are reluctant to change.
If the primary data strategy is something like operational reporting, then that’s what DWHs are good at and have been/will be for some time.
So, again how do you design a DWH? Consider the business/tech strategies, consider how to enable business capabilities with logical services, enable those services with tech. For those services that suit DWH technology, consider how you plan to extract value from service data, consider it’s users, model that data, transforming it as required into the DWH. That’s a gross simplification though.
Data Lake
So what’s the alternative? Prepare to be alarmed if you are a traditional DWH engineer. Don’t transform incoming date, store it in it’s raw format, then don’t exclude any data - store all of it forever (well depending on law/policies). Only transform data when you need to use it (called schema on read). At first this might seem an odd strategy since you are not creating consistent views for everybody. But that is kind of the point, everybody has different uses of that data, so if you try to create a one-size-fits-all then you get all the associated problems with making such a compromise. It’s also a bet against the future - you don’t know what analysis you might want to do and understand less still on how you might want to exploit it - so keep it all (in its original form) and you’ll have the best chance to exploit it later on.
Sounds easy right? Just create a big hole and dump all that data in there. Care is required though to prevent what could be a beautiful lake where everyone wants to get involved to becoming a horrible messy swamp where nobody wants to be near - comically this is actually called a data swamp. This happens when you just dump data into storage without any regard for how it might be accessed and joined together. To protect data manageability and discoverability, although you can’t (shouldn’t) change the data from its raw format you can add metadata to it (usually as a sidecar reference to the raw data). Also, where data takes the form of something like log files it is also important for users to be able to trace (correlate) records to a single user request. A lot of that is also true for DWHs as well, but without a schema on write, Data Lakes are much more prone to getting this wrong.
Data Lake technology is usually (public) cloud-based, based on commodity hardware and open source software. It can be as simple as an object store (e.g. AWS S3) with associated meta-data stored with/next to the object or in a separate store like an object database (e.g. AWS DynamoDB), usually there is also some form of search function as well (e.g. Elasticsearch). You’ll also need an ingest process to ensure the data gets stored correctly with the right meta-data (you can write your own service or cloud providers reference implementation) and gets indexed for discovery. Finally, you’ll need some sort of UI to allow people to discover that data (again, provide your own or use a reference implementation). All of this technology though is for data storage and information management.
Data Lakes don’t do much more than act as unlimited data storage of any type (although that’s a pretty good achievement). Thus, you can’t compare a Data Lake on it’s own to a DWH. To do anything with that data you’ll probably need those technologies associated with Big Data to handle distributed scalable compute/memory resources (e.g. Hadoop and Spark). This separation of storage and compute offers significant advantages, because you can scale the two separately. For example, you can use smaller resources for less time critical processing and spin up several large clusters on-demand for individual Data Scientists. It’s also a lot cheaper to do it this way, because you only keep the compute resources around for as long as you need them - they persist their data back to the lake.
Early big data solutions (before data lakes) kept the data on a single cluster, so you ended up paying for lots of idle compute just to keep the data on the cluster ready for use. Then when people started using it, you got resource contention, so other than unstructured data analysis and speed, these early solutions had many of the same problems found in a DWH. Modern solutions don’t have any of those issues though.
Here are some of the technology benefits you get compared to a DWH:
- Low effort to change - Data Lakes are schema-less, so the effort to change them is usually very low;
- Power to explore - No longer bounded by rigid structure, storage or processing limits. The only limit to innovation is the data and its users;
- Per GB storage cost is usually substantially cheaper (especially with archival options), and in public cloud the amount of storage you can have is virtually unlimited. For public cloud, costs also come in for the amount of traffic in/out of the lake, this is usually an efficient way of making data storage scalability and reliability somebody else’s problem;
- Finally, an even bigger benefit. Because all the data is kept and easily available to all, it is much quicker/cheaper to build new applications. This means applications can succeed or fail faster, so the appetite to try new things should be better.
Considering effects on the same user base who might be moving from a DWH to a Data Lake (assuming no co-existance):
- Operations - More effort/skill is required to explore and produce repeatable output (like reports) because there is no schema to begin with. Is this more effort than the initial setup and future maintenance of a DWH? Probably not, but consider operations staff might manage basic SQL queries but lack the skills to discover and join data together (at least for a time). However, once a structured view is added over that data it’s no harder for them to use than a DWH - you can even use the same tools, including SQL queries. It’s something to be aware of, but with the right approach it shouldn’t be a major factor;
- Analysts - Perhaps a bigger impact here, many of their applications (e.g. those building Data Marts) might be based on 3rd party tools like SAS or stored procedures in a DWH which may not be compatible with typical Data Lake tech like Hadoop and Spark. So lots of re-tooling and re-training maybe required. But once retooled, both Operations and Analysts users should notice exponential increases in performance of their applications for longer running tasks;
- Data Scientists - Least negatively impacted because they will likely come from a background which encourages a Data Lake approach. They also have the most to immediately gain, the fruits of their efforts then will later give big gains back to Analysts and Operations users.
If the primary reason is to keep and analyse data to exploit, monetise, future-proof and innovate. A Data Lake will be the far superior solution.
So, how do you design a Data Lake? As with a DWH approach, you start by considering the business/tech strategies, how to enable business capabilities with logical services and enabling those services with tech. Next ensure services are pushing all interesting events (entity changes, log records, etc) into a Data Lake. Then ensure that as data arrives, good quality meta-data can be applied to it, if it can’t then reject it back to the service or risk a swamp. On a per user group basis consider what insights they are looking for, then build up views just for them (if that makes it easier), hopefully standardise on the tooling as well. For Data Scientists give them an easy way to gain access/provision isolated scalable distributed compute/memory resources for whatever tools they want, then have Data Engineers support “productionising” their work.
Some personal experiences
It’s worthwhile to also consider where you want your organisation’s IP to sit. If you end up with a cloud based data lake/big data solution there is no IP in that technology for an organisation, rather it’s in the algorithms that perform the work. Many companies opt to own their algorithm IP because how it works directly affects revenue – thus you want to fully understand and tune them for specific use cases. For example, when I was heading up technology at a customer loyalty company we developed our own algorithms for assigning offers to customers, the service was sold to large retailers on the effectiveness of us putting the right offers in front of the right customers. Because most offers were points earning for the company (and we charged the retailer for issuing those points), the efficiency of the algorithms directly translated into increased revenue. So it’s was an absolute no brainer to invest in people in that area and build those algorithms ourselves because we needed that control, the ability to quickly innovate and monetise.
Similarly, how important is data-related IP to your organisation ? For example, do you plan to sell based on your effectiveness to put the right products in front of the right audiences?
Conclusion
If you are starting fresh (or maybe you have decided to focus hard on data exploitation throwing away the shackles of old), then I’d start with a Data Lake. If you have no investment in a DWH then i wouldn’t start one.
But what if you already have a well established DWH and considering a Data Lake? Firstly, don’t do a big bang replacement it will likely fail miserably, either allow the two to co-exist or follow an incremental approach. Take one use case at a time (ordered based on their data requirements - simplest first), and follow a broad approach like so:
- Either:Bulk copy the source data over from the DWH into the Data Lake as read-only, then keep it regularly sync’ed;
- Duplicate source data to sink into both the DWH and the Data Lake (preferred approach since you get it raw, but likely more work to setup);
- Re-factor the application, carefully considering choices based on a whole bunch of factors (tech strategy, team skill set, etc);
- Once all applications that require that data have been moved over if you still have that data going into the DWH then stop it and ensure the Data Lake gets the raw source now.
It might also be worth considering your whole data/reporting strategy at this point to see if applications are even worth moving over or should be created anew. It’s just fine for DWHs and Data Lakes to co-exist, but you’ll probably want the Data Lake being a sink for all the data and then the DWH pulling only views on data it needs from the Data Lake.
If you still want a DWH longer term, then have a good look at cloud products which support scaling in a way that addresses some traditional DWH constraints, also the solutions tend to be designed with work with Data Lakes. AWS Redshift is a popular example.
Finally, to ensure I keep an open mind on these sorts of technology choices. I always consider a (usually imaginary) position:
“If we were in a start-up, without the (sometimes assumed) constraints of our historical choices/organisational structure/legacy systems how would we engineer a solution?”
Then find a really good reason for not building that solution now, if you can’t find an excuse then just do it. Without that kind of relentless approach to pushing the boundaries you’ll be stuck in old paradigms, crusty old arguments, your business will wither and your competitors who have the courage to follow that sort of thinking themselves will succeed in your place.