Denormalisation - Just as much

Pushpender Garg

Published Jun 14, 2021

+ Follow

Prerequisites

Following concepts should be known:

Star schema
Facts and dimensions (SCD type 1 and 2)
Denormalisation - what it means

Problem statement

Denormalisation is an important technique to improve performance of report and analytical queries, often used with data marts and warehouses. As data has grown exponentially, it has become critical to get optimal performance (cost trade offs) out of date warehouse. There are pros and cons of denormalisation and most of times it is a balance that we will give best results. This post will be discussing some of the factors to consider.

Overview

Let's understand pros and cons of denormalisation.

Pros:

Faster response to queries
Less compute cost per query

Cons:

Needs more space
Needs more planning - Need to have better idea of query profile and relationships in advance
More complex to implement and maintain as more coordination and code needed in data pipelines. Also resulting in more delay to data loads

With modern technologies I think we can ignore space constraints for most cases. Cost of storage has reduced significantly and with compression and columnar databases, it's going to be last of your concerns.

With these pros and cons in mind let's look at how much to denormalise

Denormalising Dimensions

I think following star schema is mostly how much you should normalise dimensions. It gives right kind of balance between flexibility of analysing data across various dimensions and lower complexity of queries and loading data. Dimensions anyway doesn't take lot of space. There is lot of material already available on star schema if you want to know more.

Sometimes it's natural to have snowflake kind of schema or hybrid instead of strictly star schema due to hierarchical nature of dimensions itself. But point is, it's better to denormalise to make queries simpler and space consumed by dimensions is not much of issue. So flattened hierarchies is better option for large scale analysis. You can also consider to maintain individual tables in hierarchy like snowflake, but also have a flattened denormalised table/view who's surrogate key is stamped on facts. So basically single key representing entire hierarchy.

Even after following star schema, number of joins needed for analysis can become huge due to large number of dimensions stamped on facts. It not only makes queries complex but also impact queries response time. And that brings us to next topic.

Denormalising Facts

Let me start off by saying that I have seen projects where large scale facts were completely denormalised and this strategy failed. No doubt that it was having best query performance with simplest of queries, but there are other negative factors too (apart from cons mentioned earlier):

It is very costly to update large scale facts when there is change in dimensional value. It means when you have SCD type 1 dimensions (only latest value) and frequent updates are expected then don't denormalise that dimension in facts.
Late arriving attributes - Not all attributes of dimension arrive from same source. Many times surrogate key is available but attribute that we want to denormalise is not available in time. More often we want to load facts as soon as possible so it's better to avoid denormalising these attributes
I said space is not much of a concern, but denormalising large scale facts with hundreds of attributes will have significant impact on storage cost

In short, I will not recommend to denormalise everything (all dimensions) and denormalise large scale facts. Yes, it makes more sense to denormalise aggregates. It makes more sense to apply denormalisation on the last data layer i.e. one which is exposed for reporting and aggregates are more often the last data layer.

But if not all dimensions then which dimensions should we denormalise or stamp on aggregates? It will need some analysis to answer that question. Better approach will be, to make list of all dimensions and separate them in 3 buckets viz. high, medium and low based on their importance/usage. Analyse existing reports/dashboards and talk to business users. Dimensions which are used in more than 70% of reports/dashboard or business analysis should be "high". Between 40 to 70% is "medium" and less than 40% is "low". You should always target to denormalise "high" ones, "medium" ones depends (may be pick important ones) and don't bother about "low".

Conclusion

In most solutions that expose data layer for reporting, cost is function of storage and compute. So better performance at lesser cost is how we can define optimal usage. Denormalising is important tool to achieve optimal usage but overdo can have negative impact.

To view or add a comment, sign in

Denormalisation - Just as much

Pushpender Garg

Prerequisites

Problem statement

Overview

Denormalising Dimensions

Denormalising Facts

Conclusion

More articles by Pushpender Garg

Others also viewed

Getting Metadata-Driven in MS Fabric - Introduction

Lessons from running a Data Team

The What and the Why of Micro-partitioning in Snowflake

Boosting Code Performance: Utilizing Data Structures Effectively

Understanding Expression Indexes in Hudi: Improving Query Performance

Stop Storing Logs as Flat Tables: A Case for 'Nested Types' in Data Engineering

Text-to-SQL will replace Data Analysts

It’s Just a Schema... Until Your Business Logic Breaks

5 Reasons to Choose Direct Lake as the Semantic Model for Power BI

ClickHouse Dictionaries: A Smarter Alternative to JOINs?

Explore content categories

Prerequisites

Problem statement

Overview

Denormalising Dimensions

Denormalising Facts

Conclusion

More articles by Pushpender Garg

Apache spark small file problem, simple to advanced solutions

How to resolve Spark MemoryOverhead related errors

Multi level/dimensional bucketing in Hive

Traditional vs Agile Data Governance

AWS cloud Data and analytics architecture

Zigbee - Low cost, low bandwidth, wireless network for IOT w/o wifi or cellular

Others also viewed

Getting Metadata-Driven in MS Fabric - Introduction

Lessons from running a Data Team

The What and the Why of Micro-partitioning in Snowflake

Boosting Code Performance: Utilizing Data Structures Effectively

Understanding Expression Indexes in Hudi: Improving Query Performance

Stop Storing Logs as Flat Tables: A Case for 'Nested Types' in Data Engineering

Text-to-SQL will replace Data Analysts

It’s Just a Schema... Until Your Business Logic Breaks

5 Reasons to Choose Direct Lake as the Semantic Model for Power BI

ClickHouse Dictionaries: A Smarter Alternative to JOINs?

Explore content categories