Data Modeling for Analytics

Damir Rakytyanskyy

Published Nov 22, 2019

Once I was working with a customer who asked me to migrate the existing database from MS Access to the SQL Server. He explained that the database was produced some time ago by a consultant according to the "best practice" approach: fit everything in one table.

Perhaps it was considered a best practice once... but I'm not old enough to remember such times. My memories stretch only to the times when the relational data model in the Third Normal Form (3NF) was the best modeling technique. And it still makes a perfect sense: this model is very close to the real-life entity-relationship model and in many cases provides a good balance between query performance and model agility. But let's start from the beginning.

Entity-relationship Model

The entity-relationship model is an abstract data model formulated in the business language. It operates the entities defined and used by the business. In the case of a webshop, it will include such entities as a user, web session, basket, order, line item, etc.

All this works quite well within a specific business domain, so most of the projects stay within a single model and don't worry about anything more complex. But for the analytics, the most interesting insights are gathered when one blends the data from different domains/applications. In this case, the creation of a single model spanning the whole organization is questionable. The same entity could mean slightly different things for different users in the organization. For example, for a finance officer, the funds available could be the total amount left on the account while for the project manager the number should be reduced by expired earmarked funds, which are still on the account but should be returned to the donor next year.

The main driver for the enterprise data warehouse used to be the dream of having a single version of the truth. Realization of this dream was affected by the fact that quite often there are different legitimate versions of the truth within the enterprise and this leads to the concept of a single version of the fact and co-existence of several different models.

In any case, the abstract entity-relationship model has to be materialized in some tables, files or any other physical structures and there are many ways how this can be done. Data modeling approaches, described below, deal with this physical representation, i.e. table structures, primary/foreign keys, etc. which can hold almost any entity-relationship model or several of them.

Relational Model

The most common approach to implement the data model is design tables and foreign keys according to the so-called third normal form (3NF). In most cases, it will directly reflect the business entities (single business entity maps directly to a single table) and establish one-to-many relationships between them. One-to-one relationships are seldom used and many-to-many relationships are resolved with cross tables. The main advantage of this approach is that due to normalization it allows rather fast updates of the individual data elements. Data elements are not duplicated; each one is stored in a single place and changes to the element require writing to a single place only. Therefore such data modeling is de-facto standard in online transaction processing systems (OLTP) where most operations are creation/updates to individual transactions.

The performance of this model suffers significantly when it used for reporting and analytics as these tasks require full table scans and many joins between tables. These operations are expensive for the traditional databases as they mean extensive reads from slow disks. The situation is changing in recent years when there are massive amounts of RAM available even on client machines. Some end-user business intelligence tools use simple relational models with adequate performance as long as all the data can fit into memory. Still, these changes apply mostly to the small to medium size databases and are not yet relevant at the enterprise scale.

Relational model in analytics forms a single-layered architecture:

Dimensional Model

An alternative data modeling approach, suitable for analytics, was developed in 1970-s by Ralph Kimball. It is based on “facts” and “dimensions”. Here business entity is represented by a dimension with many attributes. And relations between entities are expressed in facts, which are collections of id fields of related dimensions and some measurable facts about this relationship. In the case of a single business domain, it is sometimes possible to use just one fact and several dimensions. Because the fact table links all dimensional tables, the diagram looks like a star and the whole approach is often called a star schema.

When the business domain is large and/or there are multiple business domains, which is typical for the enterprise data warehouse, the dimensional approach suggests having several fact tables and conformed dimensions, shared between these fact tables.

Dimensional modeling is a de-facto standard in online analytical processing (OLTP) systems and data warehousing. Most business intelligence tools expect such schema in the data source. Inventions of the columnar storage engines made this modeling even more relevant. The biggest problem with dimensional modeling is its rigidity. Making changes to the star schema, for example, adding a new dimension, often requires significant rebuilt of many related facts and ETL jobs. This significant amount of effort makes developers reluctant to schema changes after a certain level of complexity has been reached. This corresponds to the typical waterfall approach in software development style of that age. The complete data model has to be done upfront and in most implementations will stay almost unchanged. Thus, there are many cases when some data does not fit the model and has to be left out. The issue is even worse when the data warehouse must keep the history of facts and dimensions, i.e. reproduce some reports “as of” some date in the past.

Dimensional model is a two-layered architecture:

Data Marts Approach

Bill Inmon proposed an alternative approach to enterprise data warehousing. It suggests using the third normal form for the data warehouse itself and creates an additional layer of several data marts, that represent specific business domains. The main advantage of this approach is that it allows having several different models (data marts) use the same fact (data warehouse).

Data marts model is a three-layered architecture:

Data Vault Model

The data marts approach still does not address the problem of rigidity. Modifications of database build according to the third normal form still require many efforts. To address this issue, Dan Linstedt proposed a Data Vault approach where the middle layer, called the vault, is hyper normalized. I.e. facts and dimensions are split into several tables each and one-to-one relations became normal. This hyper normalized schema became very flexible and allows adding new entities quite easily.

The approach is not completely new; it is widely used in Content Management Systems design. Internal database of Drupal or SharePoint is highly normalized and this allows content authors to create new types of content (articles, posts, comments with various attributes) from the user interface. All these different content types are stored in the same database whose schema allows this flexibility. The downside of it is degraded performance. Each query usually requires many joins, and CMS developers spend significant efforts in building some layers of materialization to speed up the response time.

The same approach is used in the Data Vault model where the data marts layer ensures query performance while the vault itself is flexible to accommodate almost any change request. The Data Vault model was designed to fit the agile development process of incremental data modeling with constant changes in the requirements.

Data vault model is also three-layered architecture, where the middle layer is more flexible:

Data Vault is not the only hyper-normalized technique. Lars Rönnbäck created the Anchor Modelling approach which goes up to sixth normal form. And the main driver for this technique is that "the environment surrounding a data warehouse is in constant change. A large change on the outside of the model will result in a small change within."

Closing The Circle

We went through the set of increasingly more complex modeling techniques. As the complexity of the model increases, so does the resources needed for the implementation. What is the best model? Obviously, there is no one size fits all approach. Each project has unique requirements and constraints and the best model has to be selected based on them.

And the answer also depends on the time when the selection is made. Although data modeling is a rather conservative discipline and it absorbs new technologies with some delay comparing to the rest of IT, there is still some significant shift in thinking in recent years. Progress in the big data and cloud technologies, which drastically reduced costs and increased computational capacities, has changed the perception of the data warehousing approaches as well. The more processing power and distributed capabilities we have, the less is the number of projects which require complex data modeling techniques. The niche of simple architectures like the initial data lake idea: store the data in the raw form as files on cheap storage and run analytical queries directly from it virtualizing all the modeling hassle, is constantly growing. The end-user of the analytical system still needs the results of his query in one table which will be visualized by his BI tool or data science notebook. So we may actually come to the "single table" model as the best practice at some point in time... Until then, one has to judge the best approach for the problem at hand.

Ivan Chubov 6y

Damir, thanks for the article. It is a good and concise overview of data modeling techniques.

1 Reaction

To view or add a comment, sign in

Data Modeling for Analytics

Damir Rakytyanskyy

Entity-relationship Model

Relational Model

Dimensional Model

Data Marts Approach

Data Vault Model

Closing The Circle

More articles by Damir Rakytyanskyy

Others also viewed

Data Modeling (Normalization and Denormalization)

6 pillars of successful self-service analytics strategy

In a World Overflowing with Data, Quality Matters More Than Quantity

Topics to cover to be proficient in power query as a Data Scientist

Business Intelligence meets Data Engineering with Emerging Technologies

What Is Data Modeling? An Introduction

User Defined Functions

DATA ANALYTICS

The Value of Data Modelling - Acquiring the Right Data

Explore content categories

Entity-relationship Model

Relational Model

Dimensional Model

Data Marts Approach

Data Vault Model

Closing The Circle

More articles by Damir Rakytyanskyy

The Golden Record Is a Beautiful Lie

The Ultimate LLM Test

From Certainty to Probability: How Physics and IT Are Evolving

Dataiku to AI is like Tableau to BI

Building a Data Lake

Data Swamp

Data Lake Internals

From a Data Warehouse To a Data Lake

Others also viewed

Data Modeling (Normalization and Denormalization)

6 pillars of successful self-service analytics strategy

In a World Overflowing with Data, Quality Matters More Than Quantity

Topics to cover to be proficient in power query as a Data Scientist

Business Intelligence meets Data Engineering with Emerging Technologies

What Is Data Modeling? An Introduction

User Defined Functions

DATA ANALYTICS

The Value of Data Modelling - Acquiring the Right Data

Explore content categories