NoSQL?  No, SQL!

NoSQL? No, SQL!

New spangly database systems such as MongoDb have made it onto the scene over the last few years. They call themselves "NoSQL" and boast to be open source (free), able to handle high volume, not require schemas (no design overhead) and be in direct competition to classic Relational Database Management Systems (RDBMS) such as Oracle, MySQL and MS SQL by being exhaustively redundant (we'll cover this later) and implement the aforementioned "NoSQL" structure. Much to my surprise, businesses are lapping it up! I have tried to warn people of the pit falls as I have encountered them, but it seems the warnings are falling on deaf ears. Time will tell, let's see who is ultimately right, me, or the rest of the world. This is my testimony.

Having worked with MongoDb from the very early days and having tried to incorporate "Usability" for the end User and "Back-Office BAU Activity" and "Timely MI" for the Business (front, mid and back), I have to admit that I think MongoDb is left wanting on quite a few important levels (that I will cover below) and I sincerely believe that people will eventually wake up to this problem. I just hope it's not too late by then. I would urge anyone thinking of migrating their existing systems to MongoDb or NoSQL to heed my warnings here and at the very least, make sure you have done some very robust due diligence before committing. I suggest you insert many Documents of varying shapes and sizes into a HUGE Db for testing (a couple of thousand records will simply not suffice) and then perform some real-world MI situations such as complex conditional Sums and long Lists - not just basic Counts!

From this point on, rightly or wrongly, I will use the phrases RDBMS and SQL interchangeably but try to keep a general context. RDBMS usually refers to the technology and SQL is the activity on that technology being it is a "Structured Query Language", but most people simply refer to an RDBMS as a SQL system, so it just becomes a bit confusing and a mute point. Also I think reading or saying "RDBMS" is a bit of a mouthful compared to just saying "Relational Db"... but hey... I'm lazy... fire me!

To set the scene, I come from a Relational Db background (MS SQL and MySQL mainly) and I have a firm belief that SQL was never broken and didn't need fixing in the first place. So in order for me to be sold, NoSQL needs to give me something more and better than SQL currently offers. It has failed me. I am open to criticism, there are indeed many ways to skin a dead and already flogged horse, but at least read my arguments before passing judgement.

My main discussion (moaning) points are as follows:

  1. Database Design Considerations (Relativity vs Redundancy)
  2. Schema vs Documents and Key-Value Pairs
  3. Big Data vs BIG Data


Database Design Considerations (Relativity vs Redundancy)

Relativity (the SQL way)

With a classic Relational database design, we spend a little time up-front ascertaining what the main players are in the system and allocating predefined storage areas for those main players. They invariably are related in many different ways, not just a single directional emphasis or process and is why I believe SQL designs are very capable of handling inter-relationships and the many needs of a Business. I will use the example of a basic "School" that has "Teachers", "Students" and "Courses" for the purposes of this article. Their relationships are as follows (note: n is "any arbitrary number greater than one", there would in reality, be upper limits too, but for brevity I haven't included them here):

  1. There are 1..n Courses.
  2. Each Course has 1 Teacher.
  3. A Teacher can Teach 1..n Courses.
  4. A Course can have 1..n Students.
  5. A Student can take 1..n Courses.

2 above means we can simply store our Teacher's ID directly on the Course record. 4 & 5 above causes us an issue because we can't directly map Courses to Students and vice versa, so we will need to have a mid-point area that can store all Students that are taking the various Courses (these are commonly called "mapping tables" and are very common in Relational Dbs). So shown diagrammatically the Entity Relationship would be thus:

Nice and straight forward. The advantage of using a mapping table for Course_Students is that we can acquire a List of all "Students per Course" and also a list of all "Courses per Student" from the one data set. This is the simple characteristic and benefit of using good SQL Relational Db design, the data has been reduced to its optimised minimum "Single Truth" with no redundancy or repetition. If we change a value anywhere (such as a Student's Name), that value change will be reflected everywhere instantaneously seeing as everywhere refers to that single point in the data. Similarly with a little bit more careful joining of tables, we could also acquire a distinct list of "Teachers per Student" for example and conversely "Students per Teacher". And obviously, we can also acquire stand-alone lists of "Teachers", "Courses" and "Students". I good thing to do when dealing with data is not to lock the records when you are listing them which can stop other people updating records elsewhere in your Business, so if it is a SELECT, use (NOLOCK) - it also speeds up the acquisition of the records significantly too.

Just as a side note, I have personally developed systems using MS SQL containing over 7.5M Client records, spread across 8 tables and even with very complex selection criteria queries, we were successfully acquiring results in under 15 seconds. This was a combination of managing highly optimised data and a very good high end hardware infrastructure (future proofed). So managing high volume is simply not an issue for a good enterprise level RDBMS.

Redundancy (the NoSQL way)

So let's now look at how we would design that same School system using MongoDb. We could do exactly the same as with the Relational SQL Db and heavily utilise the MongoDb .populate functionality, but that is not the NoSQL way and I was informed by a MongoDb expert that it is actually bad practice and was why we were experiencing slow response times. So for a fair comparison we will work to the following rule: we should be able to pull the Document(s) and not need to perform any more actions on them to acquire more data or use its .populate functionality to fill out referenced data from elsewhere - it should all be there already and ready to go - hence being called Redundant and not requiring the use of SQL style JOINs ergo NoSQL.

Just for background understanding, MongoDb uses JSON (JavaScript Object Notation) for structuring data and it is converted to BSON (Binary JSON) when saving to the Db itself. To further muddy the waters, in true NoSQL, we would not even define this structure, we would allow all sorts of different Documents into the same Collection and then worry about how to get the right Documents back out later (or leave that problem to someone else further down the line as is usually the case these days - no forward thinking going on whatsoever).

So for Course interrogation and usage:

var CourseSchema = new Schema({

    Name: String,
    Academic_Year: Integer,

    // Note: Teacher is a JSON Document INSIDE the Course Document
    Teacher: {
        Name: String,
        DOB: Date,
        NI_Number: String,
    },

    // Note: Students is an Array of JSON Documents INSIDE the Course Document
    Students: [{
        Name: String,
        Gender: Boolean,
        DOB: Date,
    }]
});
        

Note here how the same Teacher will be duplicated on numerous Course Documents. The same goes for the Students as they will more than likely be taking more than one Course (English, Maths, French, Tech Drawing, Home Economics, Knitting...). Obviously that isn't the only way the data needs to be structured if we can't perform real-time SQL JOINs, it is only really any good for people needing information about Courses using the above structure. So now we need the definition for people wanting information coming at the data from the Students' perspective:

var StudentSchema = new Schema({

    Name: String,
    Gender: Boolean,
    DOB: Date,

    // Note: Courses is an Array of JSON Documents INSIDE the Student Document
    Courses: [{
        Name: String,
        Academic_Year: Integer,

        // Note:  Teacher is an embedded Doument INSIDE Each Course Document
        Teacher: {
            Name: String,
            DOB: Date,
            NI_Number: String
        }
    }]
});
        

So here, the Courses would be duplicated on each Student and the same Teachers repeated time and time again within each Course Document. And finally for people wanting information relative to Teachers (we probably shouldn't even use the word "relative" there should we):

var TeacherSchema = new Schema({
    Name: String,
    DOB: Date,
    NI_Number: String,

    // Note: Courses is an Array of JSON Documents INSIDE the Teacher Document
    Courses: [{
        Name: String,
        Academic_Year: Integer,

        // Note: Students is an Array of JSON Documents INSIDE the Courses Array
        Students: [{
           Name: String,
           Gender: Boolean,
           DOB: Date
        }]
    }],
0});
        

And this time, we have lots of duplicated Course Documents and even more duplicated Student Documents within each Course Document! This is just getting silly now!

Yet that is Redundancy for you, lots of repetition and duplication so that you don't need to go away and fetch other Data from elsewhere. I know this has been long winded. I am sorry I have had to put you through this painful repetitive process, but there is method to my apparent madness. I needed to demonstrate the ground rules that we are working with here, this is ACTUALLY what you are meant to do with MongoDb and NoSQL. Believe it or not. So the data volume can become huge, but there is lots and lots of the same data all over the place and I'm pretty damned sure that's not what DBMs and MDMs were thinking when they coined the phrase "Big Data". You save the Documents as you wish them to be pulled and used with no further manipulation, hence No SQL JOINs. So there are one or two issues I have with this approach to Db design and maintenance:

  1. The Collections are 100% Redundant - there is no more work needs to be done to acquire a Course, or a Teacher, or a Student. All the data is there for you, ready to use relative to the Document type that is required. But it takes a lot of disk space and maintenance to keep "The Truth" and also a lot of bandwidth volume when acquiring long lists of these types of data structures.
  2. We need the different Document types because we can't JOIN Collections in real-time (according to NoSQL stipulations) and it also depends what the usage requirements are for the aforementioned "Usability" for the end User and "Back-Office BAU Activity" and "Timely MI" for the Business. So we need the Collections ready to go, dependent upon those various Business needs. The Business needs may change and grow and so we could end up designing yet more Document (Schema) types that would all also need maintaining moving forward. This is like spending all day having meetings trying to work out why nothing is getting done! We are spending more time manipulating and restructuring existing data than we are developing more inclusive and engaging systems! Apologies, I'll leave my conclusions to the end from now on, but a little bit of venting is required occasionally so as not to explode with frustration.
  3. The Documents themselves can end up being very BIG, cumbersome and over-killed if all I wanted was a list of the Course Names for example. MongoDb attempts to deal with this variance in Document size by allocating an average sized space on disk for each new Document according to what it has handled up to now. So if you happen to have ONE Document that is unusually large (say a lot of Students), then all other Documents will be allocated unusually large(r) areas on disk, even though those spaces will NEVER be populated with anything, ever, at all. What a waste of space! Literally.
  4. The Redundancy (or repetition of the same data in multiple places) means that say a Course Name changes or a Teacher's NI Number was incorrect, or a Student's (fluid) Gender has changed this morning (but may be something different again this afternoon), you have to go through all the available Collections, Documents and Sub-Arrays and update every single instance of the corrected Course or Teacher or Student or ALL actually. Prior knowledge of WHERE those duplicated Documents are likely to be found is also necessary, so as to ensure you always update ALL of them EVERY time. For example, if you create a new Collection for say Exams, your Data Manager would need to ensure the CrUDs now happened to that Collection as well. If they forgot, or didn't know, or there was an interruption in the process, then you are suddenly and catastrophically hurled backwards with your hair on fire into a world where there are Documents for the same Entity (Student, Teacher, Course, Exam) that now hold different variations of "The Truth" and that is never a nice world to find yourself burning alive in. This is exhaustive work and a drain on the system's resources, from a processor/system perspective, on the database size and the Developers' resources as well. Also, bear in mind that if the Document that has been updated is an element in an Array, then the entire Array would need to be updated on the Document it is embedded within and so you are now passing huge amounts of data back and forth between the database and your codebase system (more than likely node.js), just to maintain a Student's ever changing Gender or a Teacher's NI_Number or a Course Name change. This should be a simple CrUD task for a menial piece of information but it has now affected the entire database and hence its overall integrity. Redundant? Hardly. Your database and codebase servers have been made to work very hard indeed as a matter of fact and for very little actual gain!
  5. Interrogating any of the Documents or Collections for specific items such as a specific Teacher can result in "out of memory" or "request timeout" errors because the Documents have to be interrogated one at a time (based on there being no Document Definitions or Schemas within MongoDb itself - it doesn't know the structure of each Document until it opens it up and has a peek - I cover this under Key-Value Pairs next). This makes real-time MI more and more difficult the more data has built up in your Collections because MongoDb is having to check the structure before even applying your interrogation or MI calculation logic to the Document. In short - Interrogation becomes truly exhaustive, resource hungry, slow and will ultimately result in an "out of memory" or "request timeout" error. Granted, our School example probably would get away with it for a few years being relatively (there's that word again) low volume, but as you can see, this risk would go exponential with Audit or Logging systems.
  6. Which Document holds "The Truth" for any Course, Student or Teacher? If I am referring to a Student, but that Student Document is held in its entirety, multiple times in StudentSchema, CourseSchema and TeacherSchema, then which one takes precedent when the more than likely "out of memory" or "request timeout" occurs at run-time and a roll-back has to happen? Which one can be regarded as "The Truth" at the expense of all the others? And good luck ascertaining how far your original CrUD event got prior to the crash as well, I'd like to see you pick your way through that particular mess!


Schema vs Documents and Key-Value Pairs

As can be seen from the previous section, it would still be good practice to generate a Schema for your data, even though you are managing a Document based system that is Schema-Less such as MongoDb. The reason for this is that as a Designer, Developer and Business, you need to know what data is available to you in order to make good use of it. I have heard of Businesses that have pumped literally tons of data into Schema-Less Document Collections for future use, but then later found (as I could have told them beforehand), that the data is useless because they don't know what they have and therefore have no idea how to use it!

So my ranting over re Schema-Less systems being better off still using Schemas and based on the simple obvious requirement for a Business to manage their data properly, let's now look at the differences between RDBMS and Document driven NoSQL Dbs with regards to the way they store their data. This will be a simple overview as I don't want to over complicate things for general readers and this article is a long one already (please stick with me, these are very valid points made by someone who has learned the hard way).

RDBMS (SQL) Schemas

So here would be the very basic requirements for the School Db using an RDBMS and using the same data types as the NoSQL examples given above:

All the columns have been reduced to their bare minimum so as to aid in storage size, acquisition and download speeds. This would in effect become the "Single Truth" because if any one item is changed at any time, all other areas that refer to that value would also be updated instantaneously. As stated earlier, this is a very basic example with no audit trail or roll-back.

Now let's look at what the data size on disk would be for each table's record, we will assume 64 bit for INTEGER, 8 bit for each VARCHAR and 64 bit for DATE.

  • Teachers = 64 + (8 x 50) + 64 + (8 x 9) = 600 bit.
  • Courses = 64 + (8 x 50) + 64 + 64 = 592 bit.
  • Students = 64 + (8 x 50) + 1 + 64 = 529 bit.
  • Course_Students = 64 + 64 = 128 bit.

There isn't that much really needs saying about RDBMS Schemas, they are succinct, the Schema definitions themselves do not take up any space on the records specifically, due to the fact that the space is allocated according to each record having the same columns every single time (varying based on the VARCHAR field values granted - but they will never be larger and that would just confuse things right now for no benefit whatsoever).

That said, note how I can use Column Names of variable / long length at zero expense to the Db size on disk or its efficiency during manipulations. The data is aligned relative to its position in the Definition, so what that space is actually "called" is completely irrelevant and is only a human reference ultimately. The only downside to long Column Names in SQL is when using those Names in JOINS and STORED PROCEDURES etc., the commands themselves can get cumbersome if you are being truly stupid with your naming conventions - there is no need to go to the absolute Nth degree with regards to this on your Db tables - just make them understandable and not so vague no-one knows what they are. They do not affect the actual size of the Db on disk or in memory, this is the key point here.

NoSQL Documents and Key-Value Pairs

NoSQL systems and Document driven Dbs specifically don't store a Schema Definition for their Collections. This means that a Document of any shape and size with any set of Key-Value Pairs can be within (or not) including Arrays (or not) and Sub-Documents (or not). This could cause problems if you are expecting a Key-Value Pair and it is not there because the Document structure is different than what you are expecting, so there is already a coding overhead to have to check for an item being present before trying to manipulate it. That is not my actual point here but that issue still makes an appearance when attempting to interrogate MongoDb NoSQL for the very same reason, "if exists" conditions everywhere, stealing yet more of my precious processor time! While on this point, it is worth noting that interrogating Arrays within Documents is also difficult, so if you are trying to Sum or Count or List on conditional Array elements (or just specific Array elements) - good luck with that! (the alternative in SQL would be a separate mapping table just for the Array list items and much easier to interrogate and manipulate and lists of sub-sets).

To make things a little clearer, a Key-Value Pair is where the Key e.g. "Name" is stored with the Value e.g. "Andrew D. Foster", so a Document containing a single Key-Val Pair would look like this:

{
   Name: "Andrew D. Foster"
}
        

I won't go into great detail here, but Key-Vals are actually special cases of Array elements based on the JSON model (or is it the other way round, that Arrays are special cases of Key-Val Pairs?). The Document structure is not defined anywhere and so the Data within a JSON Document needs to also define itself ON EVERY SINGLE DOCUMENT IN THE COLLECTION. Let me make this VERY clear - every single Document that you save to a Collection stores within itself, not only the actual Data, but its Definition as well. I learned this little beaut a little too late when I was learning MongoDb on the job (a POC that went into Production) and realised that you should probably use single letter Keys to save on disk space once you have millions of records and for efficiency on interrogation and retrieval in light of this requirement and bandwidth/volumes. Having long(er) Key names actually means you're making the overall Document size bigger ergo more for your bandwidths and processors to handle! But as I'm sure you can appreciate, it would then be difficult and slower to manage and maintain Documents if all you have in each is a series of element Keys sequentially named "a", "b", "c", "d", "e" with a dictionary file to identify what each element ACTUALLY IS! Can you tell that I'm not a great fan of NoSQL or MongoDb yet? It's still not given me any benefits, all I am feeling is a sense of loss, greater expense and that deep empty feeling that something has gone wrong, very wrong, or will do soon.

So that said. Let's look at the example CourseSchema basics and assume that we have already implemented a method of limiting the String types to 50 characters:

  • "_id" (8 x 3) + Value (8 x 32) = 280 bit (mandatory field you have very little control over this and they sometimes insert themselves on embedded Documents too so as to identify related / duplicated Documents that have come from elsewhere).
  • "Name" (8 x 4) + Value (8 x 50) = 504 bit.
  • "Academic_Year" (8 x 13) + Value (64) = 168 bit.
  • 280 + 504 + 168 = 952 bit.

For the purposes of trying to compare apples with apples, I am not going to include the CourseSchema's Teacher and Students elements in this data size calculation, but as can be seen already, 952 bit per Document is almost double the SQL Schema equivalent and bear in mind in NoSQL MongoDb, the Course data is also on the StudentSchema (as many times as each Student has taken a Course) AND the TeacherSchema (as many times as the Teacher gives a Course) and so I'm sure you can appreciate that the database size on disk will begin to literally explode and start to really slow down your system with higher volumes! Not to mention the already discussed allocation of average sized Document spaces when creating new Documents. Oh boy! Who's idea was it to try and compete with SQL?


Big Data vs BIG Data

Business level SQL systems such as Oracle and MS SQL can handle billions of records and also handle multiple connections with the right hardware and infrastructure, so that particular argument for NoSQL being better than SQL is a Straw Man for them to beat their very tiny drums upon.

Given the level of duplication of the same data, the storage of sometimes long Key-Val Pairs and space pre-allocation, the size of the data on disk can explode for no good reason when you consider the alternative of Schema driven SQL. This also comes into such issues as Boolean values "seeming" to be stored as literal "true" and "false" values and not well optimised 0s and 1s which can be interrogated way quicker later on.

There seems to me to be a big difference between Big Data in the sense of storing billions of records and the size of the data on disk exploding exponentially by just being BIG in the sense of being BLOATED for no good reason. This is inefficient use of resources right across the board IMHO, "small-fast-reliable" is slowly being eroded away by "bigger-slower-don't ask, it keeps crashing and I can't pull anything beyond 20,000 records".

Big Data is Good, lots of useful and usable information that I can get to logically and quickly is of benefit to all in any Business.

BIG FAT BLOATED OVERWEIGHT Data, that keeps repeating on you - IS NOT!


Conclusion

Database Design Considerations (Relativity vs Redundancy)

Redundant Systems are by design NOT storing a "Single Truth" for a Business. By definition, NoSQL Document driven databases REQUIRE duplication and repetition of the exact same data for different uses and so there is an incredibly high risk of missing elements and a slowing of production while data is simply being "maintained" by the data management team. IMHO, this is an unacceptable resource overhead of using NoSQL when considering that SQL, when designed properly, is by definition THE "Single Truth" as it springs right out of the stalls like a young heifer in springtime.

Schema vs Documents and Key-Value Pairs

Unstructured Data is useless. Anyone that advocates dumping different type and shape Documents into the same Collection and maintaining several versions of the same information just because it gets used differently across a business, should be taken outside, handed their last cigarette and promptly shot - they are clearly being paid far too much for what they do and people should not be listening to them. The overhead on all areas of a business - resources, development, disk space, efficiency, accuracy, management, delivery, production... are all jeopardised when work is being repeated over and over again for different areas of the same business.

Big Data vs BIG Data

Large amounts of information is great for a business, be it historical data to compare against, easily accessible pre-calculated statistics or just sheer number of Users. But there is a Big difference between Big Data as in Volume and BIG FAT BLOATED Data because the Structure requires it to be so and the Data quickly beginning to look like a beached whale!

The fact that the data itself becomes overweight and bloated by having to store the same Data Definitions on every single Document in the Collection is a pointless and unnecessary overhead as well. To reiterate an earlier point, just because disk space is cheap nowadays does not mean we should be filling it up with garbage that adds to processor time when manipulating and interrogating.

My final point on data size and specifically regarding MI is very important. Because MongoDb doesn't know the Data Definitions of any of the Documents in any given Collection not helped by the pre-allocation of average sized Documents, you can very quickly start experiencing "out of memory" or "request timeout" errors when trying to retrieve long lists or perform MI on large or complex Collections (and with no pagination), especially if what you are seeking is embedded in Sub-Documents or Arrays. This only became apparent to me personally when the volumes hit a critical mass and is why I am raising this alarm now for anyone thinking of using MongoDb or NoSQL in the near future. You need to be very clear that you have done your due diligence on MASSIVE data sets using real-world MI situations such as complex conditional Sums and long Lists - not just basic Counts!

"Small-Fast-Reliable" not "Bulky-Slower-Can We Get back To You?", which is a phrase you will hear regularly from your MongoDb team. Fancy a bet? :-)

Andrew D. Foster.

Structured Query Language rather than Sequential Query Language, but good post and an interesting read. Worth considering. I'd be interested to see patterns/correlations in the choice of DB vs the choice of language and if that's likely to be due to them being the right tools for the job or if it's because one is more commonly used with the other.

Great Article Andrew. Impressed with the real world comparisons you put forward. Personally, I’d go with the structured approach every time. Seems like the UI would need to carry the majority of the logic with the NoSQL route, which would otherwise be carried out on the data layer. I dare say that there's a place for NoSQL, although not one I've encountered which would make me choose it over the proper SQL approach.

Great article Sheff, really enjoyed reading this and agree RDMBS isn’t broken but No SQL DBs have a place it's just about understanding what that place is and in what context. I've spent quite a lot of time contemplating polyglot architecture of databases and using mongo for reads and sql for writes. As much as I like this approach I'm still to find the right solution to deliver on it. That said there is no reason why different applications can use the most appropriate db type as you would your programming language.

For sure bud. Just remember though, it can only be counted as a mistake if you don't learn from it. So... Lesson Learned!

Like
Reply

To view or add a comment, sign in

More articles by Andrew Foster

  • Usability - My UX/UI Hit List

    Based on my 20+ years of developing and managing the delivery of User Focused but Data Centric systems in the online…

    1 Comment

Others also viewed

Explore content categories