Data models will never become obsolete

Marko Oja

Published Feb 26, 2023

Well for some of you, the title might sound like a given, but over the years I have found out that this is far from a universal truth. About ten or so years ago when data lakes first hyped big time with good cloud availability, I wrote a blog about the need of need for data warehouses. At the time data lakes were marketed as yet another silver bullet to solve all the problems that came with DWs. “Just put all your data in it and then use it in any way you like.” was roughly the marketing promise. I did not buy into this claim, so I wrote a blog about it. My intent was not to defend the traditional ETL-based data warehouse systems on relational databases, but rather to defend the need for data models and data modeling. The blog received a mixed response.

Even though I didn’t mention any technologies to accomplish the required architecture I was greeted with arguments why my opinion was misled by the available technology stack I was “promoting”. And fair enough I was and am dedicated to Microsoft technologies. Still, I felt greatly misunderstood, and probably because I lacked the required experience, I wasn’t able really to process the feedback correctly and it left me discouraged for a long time to share my ideas publicly which was unfortunate. Now I felt it might finally be a time to “lift the cat on the table” again (a Finnish idiom about bringing up a difficult subject 😉).

Data modeling is here again, or is it?

Previously, most of the criticism was about insufficient technology stack to be able to provide a system that could handle the entire spectrum of structured, semi-structured, and unstructured data. I feel that we are now way ahead of the point where this might have been true. And still, my opinion remains to be the same: We do need a centralized governed data model for the data that we gather to be useful in many cases. Now centralized might be a bit strict way of saying this, but what I mean with it is that different parts of businesses need to agree on the terminology of core business entities and how it is presented through their data. This data modeling is needed so that data produced by these different units can be joined together to make organization-wide analytics.

The data model doesn’t need to be a physical one. Meaning that if your data model and transformations don’t require storing copies of the data for technical reasons like security or performance then it shouldn’t be done. On the other hand, there is a claim that no materialization of data is ever needed but this is just not true. There are times when it becomes purely an optimization and cost optimization thing to store some data. For example, very badly formatted source data structures make it impossible to run well-optimized queries on them and you need to reformat (model) them to be useful. But even when you don’t need to create an actual physical data model, the virtual model is still a model.

The problematic thing with data models, even if they are not physical ones, is that they are hard to build. It requires both business experts and technical data specialists to build them. This means that very different kinds of people need to come together and talk. Business experts could view this as an opportunity to mentor developers to understand their business for them to become more efficient with their job. I don’t think I know many data specialists who wouldn’t jump at the opportunity to learn about core business concepts that link to the work they are doing. When correctly done the different levels of data modeling, meaning conceptual, logical, and physical data models can become incredibly useful tools for communication. So in my opinion, even though data models take time and effort to be built, they are very much worth the time to create them.

Now there is seems to be two different camps about data modeling. Other camp thinks of them as critical over-arched communication tools required for large enterprise-wide data solutions. The other camp proposes that data models should only be built as small use-case-specific implementations. I think they are both needed. I am equally worried about going full end of either of these methods. Everything in one wide and carefully modelled structure is in my opinion where the old data warehouses failed. Instead, we should try to identify those cross-business entities needed for cross-business analytics and create a base data model for these. After and based on these general models we should build those business and case-specific implementations. This is how we can make the results from these specialized data models and calculations more easily sharable and widely adaptable.

Recommended by LinkedIn

Experimenting Big Data at Enterprise SaaS

Shan Xiong 10 years ago

The Data Engineer’s Periodic Audit

Ime Eti-mfon 2 months ago

The Future of Data Analytics: Unveiling Microsoft…

Hemant K. 2 years ago

Never forget the swamps

Now after few years after the data lakes were widely adapted, we saw the failures in them. Data was loaded into the data lakes as proposed, many times however little to no consideration of data models or quite frankly any data governance thinking whatsoever. Data was stored but it was hard to access or in some cases even to be found. In some cases, no one even knew anymore what was in these data lakes as governance with data modeling was forgotten. The data swamps were born and even though every kind of data was available it couldn’t be utilized. Currently, this type of thinking might be even scary as GDPR and other privacy laws need to be acknowledged. Is storing something, without knowing what it specifically contains, an option anymore?

No matter how you might answer the question of storing data without defining its content, I hope you agree that data needs to be discoverable and understandable. This is in my perspective the first level of data modeling. Breaking down raw data into a format where it can be understandable and somewhat useful. After that, the next level of modeling is to add the business context to it. I can only take a deep and hard look at my fortune-teller’s class orb and try to predict what is to come. One thing I wouldn’t be surprised by is that AI makes data models obsolete. Just let it read all the data and ask the questions you like. Now in the very far future, this might happen but not anytime soon. If we don’t provide the business context also known as the data model for the AI algorithms, there is no way any technical tool can understand the raw data without the business context. So, when the next time comes when someone claims that data models are obsolete, I hope we all will remember the reasons we built data warehouses in the first place and why data lakes became data swamps.

Light at the end of the tunnel

Data Lakehouse in my view is very much what I roughly expected to happen in the future ten years ago. Providing a solution that can handle and support data on all its forms and use cases. Considering that some data might actually be useful and even preferred in its raw format for some cases, like large documents for training AI models. Other data like streaming data needs just a bit of support from a metadata model to define what all the tag means and how they present the real world. And then there is some data that just doesn’t make much sense or is unusable for analytical tools without an actual well-defined data model. And when building our data platform solutions, we need to support all these use cases. It is important not to forget that this is as much a data governance and management challenge as it is a technical one. I would even argue that it’s mostly a governance and management challenge. After all, any technology is just a tool, not a solution.

I am happy that the madness of data lakes as they were marketed is mostly over and we are heading to more well-governed data solutions. But this was not the first or the last time data models were deemed unnecessary. The same thing has happened a couple of times during my carried already, first time encountered this was with the emergence of self-service BI tools, and it might have happened a few times before my time as well. So even that is not currently a direction which the business is currently heading I think we will hear it a few times before I finally am allowed to retire. Because of this, I feel that this is one of those opinions that I have which will hold against the tooth of time very well. I will let you know if I ever change my mind on the subject.

To view or add a comment, sign in

Data models will never become obsolete

Marko Oja

Recommended by LinkedIn

More articles by Marko Oja

Others also viewed

Data Lake : What, why and why not

Metadata: The queen of data.

Surviving the Implementation of a Distributed Data Platform

Next War in Data Infrastructure’s Battle is Open Data Catalog

Data Mesh on Google Cloud

Modern Data Strategy: The Foundation for AI-Driven Business

Data Mesh vs. Data Lakehouse vs. Data Warehouse: Which to Choose?

🚀 Data Warehouses: The Good, the Bad, and the Future of Data Analytics 🚀

Best data governance tool: Databricks Unity Catalog or Microsoft Purview?

Journey of Data

Explore content categories

Recommended by LinkedIn

More articles by Marko Oja

Collection of blogs

One year at Cloud1

Ensimmäiset viikkoni Cloud1:llä

Boostaa ennusteesi uudelle tasolle

Sanapilviä ja tekstianalytiikkaa

Tilan käytön havaitseminen neuraaliverkon avulla

HR analytiikka ja ennustemalliin uskomisen vaikeus

Kehittyneen analytiikan ei tarvitse olla rakettitiedettä

Azuren IoT- ja BigData-työkaluja käytännössä

Others also viewed

Data Lake : What, why and why not

Metadata: The queen of data.

Surviving the Implementation of a Distributed Data Platform

Next War in Data Infrastructure’s Battle is Open Data Catalog

Data Mesh on Google Cloud

Modern Data Strategy: The Foundation for AI-Driven Business

Data Mesh vs. Data Lakehouse vs. Data Warehouse: Which to Choose?

🚀 Data Warehouses: The Good, the Bad, and the Future of Data Analytics 🚀

Best data governance tool: Databricks Unity Catalog or Microsoft Purview?

Journey of Data

Similar topics

The Significance of Data Modeling

Data Analytics in SaaS Solutions

Explore content categories