Document Databases vs Relational

In my discussions with customers, I am spending more time these days on the benefits and drawbacks of a document database (NoSQL - database like Azure Cosmos DB, or MongoDB) vs more traditional relational databases (eg SQL Server or Oracle). Having “grown up” with traditional relational databases, making the switch took some getting used to. Note this paper isn’t meant to be an exhaustive analysis…rather, the goal is to hit the high points of why a developer would consider making the switch, and when/where to use which database type.

Document Databases from a relational perspective

Back in the 90’s, I remember a friend of mine who was an Oracle DBA taking me under his wing to teach me relational database theory, Codd’s 3rd normal form, etc.  Data entities, Tables, primary keys, foreign keys, indexing, and the plethora of related technologies that literally transformed the world. It was a magical time where any problem could be solved via solid relational model that reflected the reality of the world around us. 

However, in the early 2000’s, the world of data was changing. I remember delivering an “Intro to XML” session to a thousand developers while at Microsoft. It was a new “hierarchical world” where data no longer needed to be modeled in rows and columns. I created a document, and inside the document were tags that described data (or Metadata). The idea of “ragged hierarchies” (inconsistent hierarchy where an attribute may or may not be present) promised to change how data was exchanged between organizations. 

However, XML is SO verbose. For a single data value (say 7-10 characters), it often required 4-5 times that number of characters to describe it. Yes, it compresses down nicely because of the massive duplication of characters, but it was still a huge pain to work with. Thus, in the late 2000’s, JavaScript Object Notation (JSON) was born. There are many primers on using this technology if you are interested in learning more about JSON. However, the big benefit of it was simply it solved much of the verbosity issue of XML. You could pack more data into a file with much fewer characters…really important when sharing large amounts of data.

Exhibit 1 – JSON vs XML verbosity example

It wasn’t long before someone decided to make a database out of JSON. A company called 10gen built an open source database which eventually became known as MongoDB. Like many technologies of the day, they figured out how to hash data to spread across multiple servers using a concept called “sharding”, or simply splitting large datasets across different server/storage partitions. This allowed for massive scale limited only by the number of commodity hardware servers you are able to throw at a database problem. 

This “shared nothing” approach with sharding is unlike the relational world where databases were typically limited by a single server controlling a single file system partition, usually defined by a shared Storage Area Network (SAN).  This limited the relational database’s ability to scale. In addition, the cost of scaling is not linear. As your storage/IOPS (Input Output per Second) increases, your database solution cost hit an exponential cost curve until you finally hit a theoretical limit where the server could simply no longer handle any more storage or query load.  Shared storage models simply fail in the world of Platform as a Service (PaaS) hosted database solutions at massive scale because of their inherent limitations.

It is important to note that relational database have changed with the times. The 2 most common relational databases (SQL Server and Oracle) both now have shared nothing architecture designs for those scenarios where massive scale is required. However, I have found the cost of these solutions dramatically increases when compared to the more commodity database offerings out there.  If you have a legacy application on a relational platform, you are stuck with the database support from your vendor, and a shared storage architecture may be your only option. However, if you have control over your application, and you are willing to make the jump, document databases are absolutely worth looking at…especially if you need massive scale.

Document DB…er, Cosmos DB

Circa 2014, Microsoft released it’s first document database on Azure called “Document DB”. It was a database designed “as a service” where developers could store JSON records inside single documents as files, and query them using SQL, LINQ, Javascript, or MongoDB query language. The concept of Stored Procedures and Triggers are also supported. The cost for running Cosmos DB is based on “RU’s” or “Request Units”. An RU measures the complexity of the query you are passing to the Cosmos DB. The more complex the query, the more you pay in RU’s. Of course, since this is a service, Microsoft handles the storage, tuning, backups, and all the other operational aspects of the service. Microsoft renamed Document DB to Cosmos DB in 2017, and added the ability to encrypt data to the service, along with automatic geographic redundancy and backup.

Strengths and weaknesses of document databases

So, what exactly are the strengths and weaknesses of Cosmos DB?  Here is my take on the biggest strengths:

1.      Scalability – There simply is no substitute for scalability with a document database. There is a reason why Microsoft uses Cosmos DB as a back-end database for Skype. The concept here is that you can store Petabytes of data at relatively low costs (depending on how hard you query the data). There is tremendous peace of mind knowing that your application will scale to the moon at a linear cost.

2.      Schema simplification – Let’s face it, relational schemas can get REALLY complex. To do a correct entity model requires many different tables to solve a single problem. A document-based database allows for a schema-less structure where you can add values on the fly based on whether a data value is present or not. This has the net effect of reducing the number of documents types needed to reflect what an application requires. It isn’t uncommon to have only 4-5 document types for a relatively complex application vs dozens of tables in a relational schema.

3.      Application simplification – Depending on the data requirements of your app, data retrieval can be quite simple. Need to retrieve an organization, grab the whole document and bind it to your Web client UX! No need to do complex joins across numerous tables, or write crazy complex stored procedures like you do in a relational world. Simply fetch the document, and go. Plus, using a SQL syntax makes it relatively easy for a relational developer to switch gears to Cosmos DB.

4.      Operational simplification – In the days of On-Premise databases, you hired DBA’s to manage the servers, the databases, the backups, maintain the database, etc. What a PAIN! Those days are gone with Cosmos DB. Yes, you still need to understand what is happening beneath the covers, but there is simply no comparison to the level of training and effort required to administrate a Cosmos DB database.

5.      Database security – Need to ensure your application is HIPAA compliant? Cosmos DB has you covered with encryption. This shift in 2017 to encrypting documents was a huge step forward in ensuring your data is secure at all times. Even better, you can lock down the IP addresses of incoming requests to insure only your application code is accessed from servers/services you control.

6.      Database consistency – Microsoft offers 4 consistency levels: strong, bounded-staleness, session, and eventual. If your application deals with highly stable data (not much editing/updating of documents…which is by far the most common use case), Cosmos DB is a wonderful database to use. Consistency can be enforced at a stored procedure level, and can usually be managed in a way where RU overhead can be limited and controlled. 

7.      Dynamic tuning – no need to tune or index the database. Microsoft takes care of this for you with some basic techniques for overriding if needed.

So you may be saying to yourself, why would I ever do anything else other than Cosmos DB?  I agree…it is a very compelling platform. However, it does have a few critical “gotchas” to look out for like:

1.      Read access is COSTLY – Cosmos DB does NOT do a good job of returning large numbers of records at a time. Query times on large results can take a long time, and consume tremendous numbers of RU’s. It isn’t uncommon for operations staff to see shockingly high costs for applications that require huge numbers of records to be returned in one query.

2.      Read access is crippled by “continuation tokens” – The maximum size of a returned document is 2MB. If you request more data than this, a continuation token is returned letting the application know that if you want more data, you have to do another query. This means lots and lots of RU’s, which can add up very quickly in terms of cost if you are returning massive numbers of records at a time.

3.      No joins – Want to do a join between 2 documents (eg customers and orders), sorry…you are out of luck. This can make stored procedures unnecessarily complex and painful, or even put the burden on your application developer to figure out how to pull back only the documents you need.

4.      Pagination – Again, common theme here…the more data you return, the more pain you will feel. This is especially true when writing applications that page through lots of data at a time. This is still unreasonably hard and slow with Cosmos DB.

5.      Lack of “under the covers” visibility – Want to see a query plan? Pound sand. Want to have insights into how you can optimize queries? Eh…not so much. Cosmos DB can be very much a “black box” in terms of what it is doing, and how it is doing it.

6.      Schema enforcement – Remember the days of relational databases ensuring referential integrity? Yeah…not so much with Cosmos DB. You can easily create applications that push corrupted data if you aren’t careful. Schema-less is a double-edged sword. It is wonderful at allowing a developer to do whatever they want. Unfortunately, it allows a developer to do “whatever they want”, and accidentally violate schema conventions.

7.      Schema design – It is SO hard to rethink database schema designs with Cosmos DB. Your natural approach is to model documents after tables and joins. Avoid this at all costs. The trick to designing schema in a document database is figure out the granularity of how much to store in a single document based on what the application needs. If a lot of data is written into a single document, it makes your queries fast, easy, and low RU. However, you have to consider consistency. If more than one user wants to update that single document, you have to worry about what to do if 2 users try to save to the same document at the same time. Bottom line, if data is immutable (never changes), write big fat chunky documents. If the data is unstable/constantly changing, break things down into lots of little documents to make sure nothing gets inadvertently overwritten.  And again, always remember to think about how your application consumes the data. Try to pull back as few documents as you can in one call to simplify things. This might mean duplication of data in documents, which although is a huge no-no in relational land, it may make sense to do in a document schema.

8.      Reporting – Yes this is 2 sides of the same coin, but doing reporting off of a document database can be painful and costly…again because of the pain and cost of reading lots of data at a time. Consider pushing data out to a relational database from your document database for reporting/BI purposes.

Wrapping it up

 So, when does it makes sense to do a document database? Here are some general rules to follow:

1.      If your application is read intensive, or if you need to do complex comparisons of LOTS of records in a single query, relational may be your better bet. It will execute faster, and save you money.

2.      If your application is write intensive, and reads are limited to a few records at a time, a document database might be best for you. For example, chat applications, applications with extensive logging, etc are really great document database solutions.

3.      If your application has unpredictable schema, a document database can provide many benefits with schema-less architecture.

4.      If your application requires MASSIVE scale at consistent costs, document databases are the way to go.

5.      If your application has a large requirement for ACID transactions (transactional consistency – lots of contention for the same records at the same time by different users), a relational database may be best for you.

Bottom line, a database is a tool…like any other. The key is choosing the best tool for the job. Document databases are an amazing breakthrough in many critical areas, and should absolutely be considered by any relational database developer with the right use case.

Best of luck with your database project!

PS - interesting work-around

As indicated, Document databases (specifically Cosmos DB) has significant read limitations in terms of performance and cost. If this is the only problem you have, but it appears the document database works for everything else, look into caching JSON data into Azure BLOBS, and reading the blobs directly from Azure storage in your application. This way, you only read the data out of Cosmos DB once, and read it from Azure BLOB’s many times. This works REALLY well if your data is immutable, or relatively stable. You may even consider a “sweeping function” that runs every hour or so, and looks for records that may have been updated. If you find those records, write the full JSON BLOB out to Azure Storage, and then pull the data from your application middle tier. This will give you the best of all worlds!

Why share this article?

Mike Graber is a senior Microsoft Azure Architect with significant development experience in both applications and databases. Please reach out to me any time if you are interested in hiring us for assistance with your cloud project!

Michael.Graber@ceappdev.com




  

To view or add a comment, sign in

More articles by Michael Graber

  • The Magic of Azure Functions

    One of the more interesting aspects of using different components of Azure for building out solutions is that there are…

  • Computational Loads - Serverless Architecture

    Introduction Today, there is huge buzz in the developer community on serverless architectures. In general, serverless…

  • 7 reasons why Electric Cars are in our future

    I believe in 10 years, the majority of cars on the road will be electric cars. My 7 reasons for this belief are:…

    1 Comment
  • 5 traits of successful people

    Over the years, I have compiled a list of traits that I find all successful people have. I try my best to live up to…

  • Database Hypercluster Virtual Private Cloud Architecture

    Having observed MANY IT organization’s database architecture, I believe most can cut their database licensing spending…

  • Future of the auto industry

    So lately, I have been intrigued with the announced capabilities of autonomous cars, and what it will mean to humanity…

  • Business case Enterprise IT infrastructure migration to public cloud

    So, you have read all of the IT trade press discussions about moving to the cloud, and are now considering taking that…

    1 Comment
  • Enterprise database migration to the public cloud

    Technical feasibility of moving databases to the public cloud Cloud…Cloud…Cloud. The hype around moving IT workloads to…

    2 Comments

Explore content categories