Data Modeling for Analytics

Data Modeling for Analytics

Once I was working with a customer who asked me to migrate the existing database from MS Access to the SQL Server. He explained that the database was produced some time ago by a consultant according to the "best practice" approach: fit everything in one table.

Perhaps it was considered a best practice once... but I'm not old enough to remember such times. My memories stretch only to the times when the relational data model in the Third Normal Form (3NF) was the best modeling technique. And it still makes a perfect sense: this model is very close to the real-life entity-relationship model and in many cases provides a good balance between query performance and model agility. But let's start from the beginning.

Entity-relationship Model

The entity-relationship model is an abstract data model formulated in the business language. It operates the entities defined and used by the business. In the case of a webshop, it will include such entities as a user, web session, basket, order, line item, etc.

All this works quite well within a specific business domain, so most of the projects stay within a single model and don't worry about anything more complex. But for the analytics, the most interesting insights are gathered when one blends the data from different domains/applications. In this case, the creation of a single model spanning the whole organization is questionable. The same entity could mean slightly different things for different users in the organization. For example, for a finance officer, the funds available could be the total amount left on the account while for the project manager the number should be reduced by expired earmarked funds, which are still on the account but should be returned to the donor next year.

The main driver for the enterprise data warehouse used to be the dream of having a single version of the truth. Realization of this dream was affected by the fact that quite often there are different legitimate versions of the truth within the enterprise and this leads to the concept of a single version of the fact and co-existence of several different models.

In any case, the abstract entity-relationship model has to be materialized in some tables, files or any other physical structures and there are many ways how this can be done. Data modeling approaches, described below, deal with this physical representation, i.e. table structures, primary/foreign keys, etc. which can hold almost any entity-relationship model or several of them.

Relational Model

The most common approach to implement the data model is design tables and foreign keys according to the so-called third normal form (3NF). In most cases, it will directly reflect the business entities (single business entity maps directly to a single table) and establish one-to-many relationships between them. One-to-one relationships are seldom used and many-to-many relationships are resolved with cross tables. The main advantage of this approach is that due to normalization it allows rather fast updates of the individual data elements. Data elements are not duplicated; each one is stored in a single place and changes to the element require writing to a single place only. Therefore such data modeling is de-facto standard in online transaction processing systems (OLTP) where most operations are creation/updates to individual transactions.

The performance of this model suffers significantly when it used for reporting and analytics as these tasks require full table scans and many joins between tables. These operations are expensive for the traditional databases as they mean extensive reads from slow disks. The situation is changing in recent years when there are massive amounts of RAM available even on client machines. Some end-user business intelligence tools use simple relational models with adequate performance as long as all the data can fit into memory. Still, these changes apply mostly to the small to medium size databases and are not yet relevant at the enterprise scale.

Relational model in analytics forms a single-layered architecture:

No alt text provided for this image

Dimensional Model

An alternative data modeling approach, suitable for analytics, was developed in 1970-s by Ralph Kimball. It is based on “facts” and “dimensions”. Here business entity is represented by a dimension with many attributes. And relations between entities are expressed in facts, which are collections of id fields of related dimensions and some measurable facts about this relationship. In the case of a single business domain, it is sometimes possible to use just one fact and several dimensions. Because the fact table links all dimensional tables, the diagram looks like a star and the whole approach is often called a star schema.

When the business domain is large and/or there are multiple business domains, which is typical for the enterprise data warehouse, the dimensional approach suggests having several fact tables and conformed dimensions, shared between these fact tables.

Dimensional modeling is a de-facto standard in online analytical processing (OLTP) systems and data warehousing. Most business intelligence tools expect such schema in the data source. Inventions of the columnar storage engines made this modeling even more relevant. The biggest problem with dimensional modeling is its rigidity. Making changes to the star schema, for example, adding a new dimension, often requires significant rebuilt of many related facts and ETL jobs. This significant amount of effort makes developers reluctant to schema changes after a certain level of complexity has been reached. This corresponds to the typical waterfall approach in software development style of that age. The complete data model has to be done upfront and in most implementations will stay almost unchanged. Thus, there are many cases when some data does not fit the model and has to be left out. The issue is even worse when the data warehouse must keep the history of facts and dimensions, i.e. reproduce some reports “as of” some date in the past.

Dimensional model is a two-layered architecture:

No alt text provided for this image

Data Marts Approach

Bill Inmon proposed an alternative approach to enterprise data warehousing. It suggests using the third normal form for the data warehouse itself and creates an additional layer of several data marts, that represent specific business domains. The main advantage of this approach is that it allows having several different models (data marts) use the same fact (data warehouse).

Data marts model is a three-layered architecture:

No alt text provided for this image

Data Vault Model

The data marts approach still does not address the problem of rigidity. Modifications of database build according to the third normal form still require many efforts. To address this issue, Dan Linstedt proposed a Data Vault approach where the middle layer, called the vault, is hyper normalized. I.e. facts and dimensions are split into several tables each and one-to-one relations became normal. This hyper normalized schema became very flexible and allows adding new entities quite easily.

The approach is not completely new; it is widely used in Content Management Systems design. Internal database of Drupal or SharePoint is highly normalized and this allows content authors to create new types of content (articles, posts, comments with various attributes) from the user interface. All these different content types are stored in the same database whose schema allows this flexibility. The downside of it is degraded performance. Each query usually requires many joins, and CMS developers spend significant efforts in building some layers of materialization to speed up the response time.

The same approach is used in the Data Vault model where the data marts layer ensures query performance while the vault itself is flexible to accommodate almost any change request. The Data Vault model was designed to fit the agile development process of incremental data modeling with constant changes in the requirements.

Data vault model is also three-layered architecture, where the middle layer is more flexible:

No alt text provided for this image

Data Vault is not the only hyper-normalized technique. Lars Rönnbäck created the Anchor Modelling approach which goes up to sixth normal form. And the main driver for this technique is that "the environment surrounding a data warehouse is in constant change. A large change on the outside of the model will result in a small change within."

Closing The Circle

We went through the set of increasingly more complex modeling techniques. As the complexity of the model increases, so does the resources needed for the implementation. What is the best model? Obviously, there is no one size fits all approach. Each project has unique requirements and constraints and the best model has to be selected based on them.

And the answer also depends on the time when the selection is made. Although data modeling is a rather conservative discipline and it absorbs new technologies with some delay comparing to the rest of IT, there is still some significant shift in thinking in recent years. Progress in the big data and cloud technologies, which drastically reduced costs and increased computational capacities, has changed the perception of the data warehousing approaches as well. The more processing power and distributed capabilities we have, the less is the number of projects which require complex data modeling techniques. The niche of simple architectures like the initial data lake idea: store the data in the raw form as files on cheap storage and run analytical queries directly from it virtualizing all the modeling hassle, is constantly growing. The end-user of the analytical system still needs the results of his query in one table which will be visualized by his BI tool or data science notebook. So we may actually come to the "single table" model as the best practice at some point in time... Until then, one has to judge the best approach for the problem at hand.

Damir, thanks for the article. It is a good and concise overview of data modeling techniques.

To view or add a comment, sign in

More articles by Damir Rakytyanskyy

  • The Golden Record Is a Beautiful Lie

    Why “one perfect customer” collapses under law, context, and physics—and what to build instead A familiar annoyance…

    2 Comments
  • The Ultimate LLM Test

    There are many Large Language Models (LLMs) available today. All promise high performance and reasonable prices, at…

  • From Certainty to Probability: How Physics and IT Are Evolving

    Let’s take a little journey back in time to the 19th century. Imagine you’re a scientist in this era, and everything in…

  • Dataiku to AI is like Tableau to BI

    Many years ago, there was a time BI meant reports developed by professional programmers. Then came Tableau and brought…

    1 Comment
  • Building a Data Lake

    This is the final post in Data Lake series started here. We'll have a look at what software choices we have to…

  • Data Swamp

    Next: Building a Data Lake Previous post Data Lake Internals covered the essential layers of the typical data lake…

  • Data Lake Internals

    Next: Data Swamp This post is a continuation of the From a Data Warehouse to a Data Lake and it answers some of the…

    1 Comment
  • From a Data Warehouse To a Data Lake

    Next: Data Lake Internals A non-IT colleague asked me recently what is a data warehouse and how it is different from a…

    8 Comments

Others also viewed

Explore content categories