Why your master datastore should be immutable

Why your master datastore should be immutable

Thinking about how your master dataset is stored and managed is arguably the most important part of architecting a data-driven system. Your master dataset is your source of truth. An issue here, be it hardware failure, human error, or something else, could easily result in irreparable damage to some or all of your data.

I would suggest there is only 1 hard requirement for your master datastore: Be correct. Always.

Making your master datastore immutable, i.e. append only, really helps with this.

Human-fault tolerance

I’m sure anyone who has been working with databases for a sufficient amount of time has accidentally deleted data, or performed an incorrect update, losing some data for ever. I certainly have - more than a few times.

With an immutable datastore, you can’t lose data. A mistake might cause bad data to be written, but the good data is still there, and you can always handle the bad data at load time.

No matter how careful you are, people will make mistakes, and you must do what you can to limit the impact of them.

Bug tolerance

Even if you are the most careful person in the world, and you are the only person allowed to interact with the data, you probably still write code, and occasionally that code will have bugs.

Again, if you have a code that does a bad update or delete, you’ve lost your data. For good.

What about backups?

Good, regular backups of your master datastore are essential. But backups are always a little out of date. And any problems with your data will soon make it’s way to the backup. So if you’re relying on backups, they better be really good, and you had better notice any mistakes really quickly, and even that is not a complete guarantee.

Disadvantages of immutable datastores

Of course, there are disadvantages to immutable datastores. You might be storing data over and over again that you no longer care about (although I would argue the history of a record will have some value - you’re just not using it yet). Encoding and/or compressing your data can help with this. Using relatively cheap datastores may also make this less of a concern (HDFS, S3, etc), though make sure they are reliable.

A bigger concern might be having to apply all the updates and deletes at view time, which is probably going to be slow. In that case, maybe you don’t use this datastore for time critical views. Maybe you have another dataset, in a mutable datastore, that is used for these views. A “hot store”, if you like, optimised for these views.

It’s OK for this hot store to lose data, through bugs, human error or hardware failure, because you know you can always regenerate it from your immutable, always correct, master datastore.

Originally published at https://andrew-jones.com/blog/why-your-master-datastore-should-be-immutable/

To view or add a comment, sign in

More articles by Andrew Jones

  • A Data Platform is not just for analytics

    Data is at the heart of every meaningful service, and it’s the effective use of that data which builds a product, and a…

  • Data Contracts

    Almost all data platforms start with a change data capture (CDC) service to extract data from an organisations…

  • We had an incident, and it was great

    We recently had an incident with our data pipeline, resulting in data being lost on route to our data platform. Of…

  • The democratisation of Data Science

    There’s a trend in the industry to make data science and machine learning more accessible, allowing engineers to build…

  • What does a Tech Lead do?

    I've been a Tech Lead for a few years now, though I'd say I've only been a good Tech Lead for about a year. So what…

  • Lambda Architecture in 2020

    As I start to think about some of the upcoming projects we'll be working on over the next year and how we might go…

  • The benefits of postmortems

    Postmortems are a well established process followed in the aftermath of an incident. Often the most visible output from…

  • You build it, you run it - you manage your data

    Many tech companies follow the "you build it, you run it" mantra - and for good reason. You want to empower your teams…

  • If your data is worth storing, it's worth structuring

    When some people talk about a Data Lake (or Hadoop, or even just Big Data), they go on to say that we can store all our…

  • Momentum

    In the Michael Connelly books, Detective Harry Bosch often talks about the importance of momentum when working a case…

Others also viewed

Explore content categories