It Wasn’t Me: Coding Database Errors and Responsibility
I like to take on total responsibility for a project. If I am running on the cloud, every task that I can take away from the cloud and pass the responsibility to my database, I will. I want as much control as possible in managing data and handling errors. Taking on as much operational command as possible enforces one of the prime directives of coding software:
It is always my fault.
Whether operating a database on a cloud platform or a local operating system, if something doesn’t go right, the first culprit is my own code. To fix things, I need to read the documentation, step through my code, and look for where the bug is and fix it.
The assumption is that it's always my mess ....
Until it isn't.
In those cases, I get to do something developers love: I can focus is on other people's bugs.
When they say “I didn’t do it!”
The term, “select” is broken comes from a book called "The Pragmatic Programmer." It maintains that it’s best to assume that any bug in the system is your bug. It means never believe someone, even if it is the voice in your head that says, “I didn’t do it.”
The chances of something going wrong that isn’t your fault is one in a million. Almost all of the time a bug will be yours to solve. But that contradicts another quote that I am quite fond of:
One in a million is next Tuesday.
- Gordon Letwin
For example, take a machine with less than 1 GHz CPU. There are very few of them out there. Even a Raspberry Pi, the smallest machine, has more CPU speed. 1 GHz is a million operations per second. That means one in a million happens every second.
The one in a million chance that "select" will be broken can happen immediately. Every second someone can say, “I didn’t do it!” and he will be right.
Case Study in Linux Memory Management: Avoiding the Ninja
RavenDB ran on Windows for a long time. A few years ago, we started to run on Linux. As is the case with new code, we got some exceptions. We began to worry when things began to crash. After scouring RavenDB for problems, we found the culprit.
It was next Tuesday. That one in a million event where “I didn’t do it!”
The problem was how Linux allocates memory. It’s kind of... quirky. Linux doesn't have an effective way of doling out memory for all the operations that are asking for it. It operates like a (shifty) bank. A bank that will loan out far more money than it should under the assumption that most of the depositors won’t ask for their money all at once.
The problem occurs when there is a run on the bank and everybody demands their money ... most of which is not immediately available. For a bank they call the FDIC to bail them out. Linux will dispatch a ninja.
The Windows Operating system has a very regimented way of allocating memory, especially to processes that use lots of memory like a database. If you write good code for a Windows operating system, you will get an error code telling you that the system ran out of memory. You will know to flush my caches, manage memory better, or find other ways to handle these errors. The ball is always in your court.
Linux doesn’t work that way. When their memory limits are breeched, it will send out a ninja to pick a random process and kill it with an “out of memory” error.
RavenDB runs everywhere. We have cloud instances where RavenDB Cloud is processing terabytes of data on hundreds of cores, and we have on-premises licenses where one Raven instance sits in a Raspberry Pi counting the number of times the garage door opens and closes.
We can never know when we are about to get slashed by the ninja. The problem can happen because of a cron job to update the system clock from the network. The NTP service needs to allocate, and there isn’t enough memory to do so. The OOM Killer will select a victim process and kill it.
In this case, it isn’t even my process that has been the culprit, it is something else that I have no control over, but has a huge impact on me. You can write proper code, but when Linux flashes the "out of memory" exception, in the client’s eyes, it's the fault of the database.
Fixing the Database Problem
From a service standpoint, Select is never broken. It may not be our fault, but it certainly is our responsibility. If we don’t handle this exception you will have to – and we don’t want that.
Option one is to request more memory. As a software provider, nobody makes friends by asking clients to buy additional hardware to run the software they are already paying for. On the cloud, memory is a minute to minute expense that we all want to minimize.
RavenDB is configured not to allocate memory from the operating system directly, but to allocate memory by creating a file on disk and memory mapping that. That way, the memory RavenDB uses doesn't get counted as the same memory that triggers the out of memory killer. It gets put in a different bucket.
If RavenDB runs out of memory, Linux will push that to the temporary file you created, which we can increase with ease. The ninja is still out there looking for it’s next victim, but RavenDB is safely beyond its reach.
Oren Eini is the CEO of RavenDB, a NoSQL Distributed Database that's Fully Transactional (ACID) both across your database and throughout your database cluster. RavenDB Cloud is the Managed Cloud Service (DBaaS) for easy use. He has been blogging for over 15 years about coding using his alias Ayende Rahien.