Moving a Customer Database from SQL to noSQL
I’m not from Oklahoma but I’m a big fan of “Show Me”. I’ve read numerous articles about noSQL databases and their uses, advantages, disadvantages, economics and scalability. All of these discussions are necessary when evaluating a shift in data, software and system architecting. However, there comes a point in time when a use case is required to make that final push towards a decision.
My use case is specifically addresses moving a customer database from a set of MySQL tables to a set MongoDB collections. The MySQL tables are assumed to be using the InnoDB engine meaning they have referential constraints and integrity. The information architecture will include customer name, address, order, product and payment information.
The Relational System
I chose to do a customer based system because it is a common model that touches upon several aspects of data design. A customer database can be a very focused model or one that is expansive and incorporates accounting, supply chain management and customer relationship management. In this example I will restrict the model to customers, products, orders and payment methods.
This should look familiar and certain business assumptions have been made. Those assumptions are:
- Orders are shipped as a single unit to a single location
- Any one Product can appear in multiple categories
- Any one Product can be sourced from multiple suppliers
- Customers can have multiple addresses
- Customers can have multiple Payment options
The noSQL Approach
There are two common mistakes made at this point: one, just trying to replicate the relational tables into collections and two, completely scrapping any notion of the relationships between the data points. Just because you are choosing to use a non-relational database does not mean your data doesn’t have relations. The relations still exist but can now be handled in more scalable and flexible constructs.
I plan to focus on two aspects of the implementation: Product Category and Payment Methods.
Product Category
The important point to remember here is the only thing that stays the same is change. A basic technique used an RDBM is the use of what I call “set” tables. The sole purpose of these tables is to allow flexibility in label naming conventions while insulating the internal working of the system from these labels that may change over time. This means that that every label has a corresponding ID that is used by the system to identify it. The label itself is used for UI purposes and can be modified without affecting functionality. There is absolutely no reason to abandon this technique when moving to a schema-less database.
Payment Info
Many years ago I remember coming across the C programming union and its’ seemly limitless possibilities. It allowed one variable to morph into any number of things and an array of unions could encapsulated a variety of information into a collection. It’s not surprising then that MongoDB refers to tables as collections and the individual elements of that collection can be anything you want.
When you think about a payment method these days it could be anything from a PayPal/Venmo Account to a Credit Card to the old school checking account. Each has its’ own set of data points. MongoDB can easily accommodate this due to the union like property of their collections being able to contain documents of different structures. Simply said, each document could be something completely different which will require sharing only a document type field which implies the contents of the document. I a very theoretical sense a single MongoDB Collection could encapsulate an entire database with this approach, equating table names with document types. While this is perfectly fine in theory it becomes a mess in practice.
MongoDB offers another option in the form of a DBref. This is about as close as MongoDB will come to relational data. As an example, if a user has a credit card as one of their payment methods the order or customer document will contain a DBref to a payment information document. The DBref, actually a small object, includes not only the MongoDB ID of the related document but also an identifier specifying the collection to find it. In this way you can have one collection for credit cards, another for PayPal and a third for checking accounts, etc. In a relational system you usually have two choices; one is to use this same multiple table approach which requires a mapping table that contains both the user’s id and the id of the payment method with an additional field that identifies its type which can be used to trigger the code to use the correct table. The second option is to have a table that can encapsulate all possible combinations of data for different payment types. My hesitation to take this approach is that you end up with a sparsely populated table and over time it becomes difficult to integrate new payment types.
To facilitate the use of this type of relation, we’ve created a function within our system to automatically de-reference DBRef fields within a document. With proper safeguards in place to control the depth of de-referencing, it defaults to one level, and the option of identifying the fields that you want de-referenced, it becomes very easy to pull a single data structure for use within the code that is sourced from multiple collections within the database.
Conclusion
It is important to stress that moving a system from a RDBM system to a document based approach is not a simple recreating the tables or a total abandonment of the relations that exist within your data. Instead it is approaching the solution from a different fresh perspective which will allow you to capitalize on the strengths of document based data storage engines while still preserving the natural relations that occur within your dataset.