Data models will never become obsolete

Data models will never become obsolete

Well for some of you, the title might sound like a given, but over the years I have found out that this is far from a universal truth. About ten or so years ago when data lakes first hyped big time with good cloud availability, I wrote a blog about the need of need for data warehouses. At the time data lakes were marketed as yet another silver bullet to solve all the problems that came with DWs. “Just put all your data in it and then use it in any way you like.” was roughly the marketing promise. I did not buy into this claim, so I wrote a blog about it. My intent was not to defend the traditional ETL-based data warehouse systems on relational databases, but rather to defend the need for data models and data modeling. The blog received a mixed response.

Even though I didn’t mention any technologies to accomplish the required architecture I was greeted with arguments why my opinion was misled by the available technology stack I was “promoting”. And fair enough I was and am dedicated to Microsoft technologies. Still, I felt greatly misunderstood, and probably because I lacked the required experience, I wasn’t able really to process the feedback correctly and it left me discouraged for a long time to share my ideas publicly which was unfortunate. Now I felt it might finally be a time to “lift the cat on the table” again (a Finnish idiom about bringing up a difficult subject 😉).

 

Data modeling is here again, or is it?

Previously, most of the criticism was about insufficient technology stack to be able to provide a system that could handle the entire spectrum of structured, semi-structured, and unstructured data. I feel that we are now way ahead of the point where this might have been true. And still, my opinion remains to be the same: We do need a centralized governed data model for the data that we gather to be useful in many cases. Now centralized might be a bit strict way of saying this, but what I mean with it is that different parts of businesses need to agree on the terminology of core business entities and how it is presented through their data. This data modeling is needed so that data produced by these different units can be joined together to make organization-wide analytics. 

The data model doesn’t need to be a physical one. Meaning that if your data model and transformations don’t require storing copies of the data for technical reasons like security or performance then it shouldn’t be done. On the other hand, there is a claim that no materialization of data is ever needed but this is just not true. There are times when it becomes purely an optimization and cost optimization thing to store some data. For example, very badly formatted source data structures make it impossible to run well-optimized queries on them and you need to reformat (model) them to be useful. But even when you don’t need to create an actual physical data model, the virtual model is still a model.

The problematic thing with data models, even if they are not physical ones, is that they are hard to build. It requires both business experts and technical data specialists to build them. This means that very different kinds of people need to come together and talk. Business experts could view this as an opportunity to mentor developers to understand their business for them to become more efficient with their job. I don’t think I know many data specialists who wouldn’t jump at the opportunity to learn about core business concepts that link to the work they are doing. When correctly done the different levels of data modeling, meaning conceptual, logical, and physical data models can become incredibly useful tools for communication. So in my opinion, even though data models take time and effort to be built, they are very much worth the time to create them.

Now there is seems to be two different camps about data modeling. Other camp thinks of them as critical over-arched communication tools required for large enterprise-wide data solutions. The other camp proposes that data models should only be built as small use-case-specific implementations. I think they are both needed. I am equally worried about going full end of either of these methods. Everything in one wide and carefully modelled structure is in my opinion where the old data warehouses failed. Instead, we should try to identify those cross-business entities needed for cross-business analytics and create a base data model for these. After and based on these general models we should build those business and case-specific implementations. This is how we can make the results from these specialized data models and calculations more easily sharable and widely adaptable.

 

Never forget the swamps

Now after few years after the data lakes were widely adapted, we saw the failures in them. Data was loaded into the data lakes as proposed, many times however little to no consideration of data models or quite frankly any data governance thinking whatsoever. Data was stored but it was hard to access or in some cases even to be found. In some cases, no one even knew anymore what was in these data lakes as governance with data modeling was forgotten. The data swamps were born and even though every kind of data was available it couldn’t be utilized. Currently, this type of thinking might be even scary as GDPR and other privacy laws need to be acknowledged. Is storing something, without knowing what it specifically contains, an option anymore?

No matter how you might answer the question of storing data without defining its content, I hope you agree that data needs to be discoverable and understandable. This is in my perspective the first level of data modeling. Breaking down raw data into a format where it can be understandable and somewhat useful. After that, the next level of modeling is to add the business context to it. I can only take a deep and hard look at my fortune-teller’s class orb and try to predict what is to come. One thing I wouldn’t be surprised by is that AI makes data models obsolete. Just let it read all the data and ask the questions you like. Now in the very far future, this might happen but not anytime soon. If we don’t provide the business context also known as the data model for the AI algorithms, there is no way any technical tool can understand the raw data without the business context. So, when the next time comes when someone claims that data models are obsolete, I hope we all will remember the reasons we built data warehouses in the first place and why data lakes became data swamps.

 

Light at the end of the tunnel

Data Lakehouse in my view is very much what I roughly expected to happen in the future ten years ago. Providing a solution that can handle and support data on all its forms and use cases. Considering that some data might actually be useful and even preferred in its raw format for some cases, like large documents for training AI models. Other data like streaming data needs just a bit of support from a metadata model to define what all the tag means and how they present the real world. And then there is some data that just doesn’t make much sense or is unusable for analytical tools without an actual well-defined data model. And when building our data platform solutions, we need to support all these use cases. It is important not to forget that this is as much a data governance and management challenge as it is a technical one. I would even argue that it’s mostly a governance and management challenge. After all, any technology is just a tool, not a solution.

I am happy that the madness of data lakes as they were marketed is mostly over and we are heading to more well-governed data solutions. But this was not the first or the last time data models were deemed unnecessary. The same thing has happened a couple of times during my carried already, first time encountered this was with the emergence of self-service BI tools, and it might have happened a few times before my time as well. So even that is not currently a direction which the business is currently heading I think we will hear it a few times before I finally am allowed to retire. Because of this, I feel that this is one of those opinions that I have which will hold against the tooth of time very well. I will let you know if I ever change my mind on the subject.

To view or add a comment, sign in

More articles by Marko Oja

  • Collection of blogs

    I started to review the blogs I have written in past few years. Unfortunately they get lost in all the data available…

  • One year at Cloud1

    I feel quite entitled that I have found such a wonderful place to work at. It has been now a year since I started…

  • Ensimmäiset viikkoni Cloud1:llä

    On pitänyt kirjoittaa "päiväkirjan" tynkä ensimmäisistä päivistä Cloud1:llä, mutta jäänyt kaikessa hässäkässä…

  • Boostaa ennusteesi uudelle tasolle

    Myönnettäköön että artikkelin otsikko saattaa olla tällä kertaa hieman provosoiva :) Oli vaikeuksia keksiä hyvää. Jos…

  • Sanapilviä ja tekstianalytiikkaa

    Vapaamuotoinen teksti kuten twiitit, postaukset, uutiset ja dokumentit, ovat mielletty osaksi BigDataa hyvin aikaisesta…

  • Tilan käytön havaitseminen neuraaliverkon avulla

    Jokunen vuosi takaperin aloitimme työkavereiden kanssa projektia, joka jäi kylläkin silloin hieman kesken. Meillä oli…

  • HR analytiikka ja ennustemalliin uskomisen vaikeus

    Ps. Artikkeli hieman venähti kirjoitettaessa, joten jos tekniikkatasolla ei R hirveästi innosta, niin suosittelen…

    1 Comment
  • Kehittyneen analytiikan ei tarvitse olla rakettitiedettä

    Olin mukana joitakin viikkoja sitten kehittyneen analytiikan workshopissa, jossa esiteltiin Visual Studiota R-kielen…

    1 Comment
  • Azuren IoT- ja BigData-työkaluja käytännössä

    Koskapa työn parissa ei ole yrityksestä huolimatta tullut eteen IoT- tai Big Data-projekteja, niin aloin muutama päivä…

Others also viewed

Explore content categories