Choose the right database for your job
Data is a strategic asset for every organization. As data continues to exponentially grow, databases are becoming increasingly crucial to understanding data and converting it to valuable insights. Provisioning, operating, scaling, and managing on-premise databases is tedious, time-consuming, and expensive. The IT leaders will end up spending more time on infrastructure management rather than innovating and building new applications.
Moving on-premises data to managed databases built for the cloud can help you reduce time and costs. Once your databases are in the cloud, you can innovate and build new applications faster while getting deeper and more valuable insights. Migrating to the cloud is the first step toward entering the era of purpose-built databases. But once in the cloud, how do we know which types of databases to use for which functions? This is what we are gonna cover in the rest of the blog.
Why think beyond Relational?
I know you might be thinking about why I need to consider all these purpose-built databases while my current system works fine with the relational model. Relational databases were designed for tabular data with consistent structure and fixed schema. They work for problems that are well defined at the onset. Traditional applications like ERP, CRM, and e-commerce need relational databases to log transactions and store structured data, typically in GBs and occasionally TBs.
While the relational database model is still essential, the relational approach alone will not work for today's world. With the rapid growth of data, not just in volume and velocity but also in variety, complexity, and interconnectedness, the needs of databases have changed. Many new applications that have social, mobile, IoT, and global access requirements cannot function properly on a relational database alone. These applications need databases that can store TBs to PBs of new types of data, provide access to data with millisecond latency, process millions of requests per second, and scale to support millions of users anywhere in the world.
To create applications that meet these demands, developers must choose between a number of purpose-built database models. They must understand which database type to use when selecting the right tool for the right job.
Different database types
Now that we know why to consider different database types now let us jump into different database types, their advantages, and disadvantages and where to use what aspects.
1. Relational Databases
In relational database management systems (RDBMS), data is stored in a tabular form of columns and rows, and data is queried using the Structured Query Language (SQL). Each column of a table represents an attribute, each row in a table represents a record, and each field in a table represents a data value. Developers are more familiar with the relational database model and it is easy to learn. An example of relational database design is shown below
For organizations that need to store predictable, structured data with a finite number of individuals or applications accessing it, a relational database is still the best option. A few use cases are
- Enterprise resource planning (ERP).
- Customer relationship management (CRM).
- Financial Institutes where transactions are very important.
- Data Warehousing
Advantages
- Works well with structured data.
- Supports ACID transactional consistency support “joins”.
- Comes with built-in data integrity.
- Ensures data accuracy and consistency.
- Constrains relationships in this system.
- Equipped with limitless indexing
Not Designed For
RDMS is not designed for Semi-structured or sparse data and scaling relational databases is very hard. If you are using a serverless backend using AWS Lambda with a relational database then we are limiting the scaling capabilities of Lambda. I came across this beautiful article on How RDS and Lambda can play together, worth knowing.
AWS References
Amazon Relational Database Service (Amazon RDS) makes it easier to set up, operate, and scale a relational database in the cloud. Aurora Serverless is my favorite.
2. Key Value databases
A key-value database stores data as a collection of key-value pairs in which one or more keys serve as a unique identifier for the index. Schema is flexible for pairs which are not indexed, and may even be sparse. Values can be anything, ranging from simple numbers to compound strings or complete JSON documents. Key-value stores lend themselves well to sharded horizontal scaling, allowing for consistent read and write operation free from the constraints of vertically scaling of any single node. For operational needs where the access patterns are known and fully indexed, key Value databases can provide consistent low latency performance at any scale.
A few use cases are
- Real-time bidding.
- Shopping cart.
- Product catalog.
- Customer preferences.
- IoT data collection.
- Sorted order or activity histories
Advantages
- Simple data format makes write and read operations fast.
- Value can be anything, including JSON, flexible schemas.
- Scaling decoupled from the CPU load of any single node, resulting in consistent low latency regardless of throughput requirements.
- Schema flexibility allows for sparse storage and direct developer ownership.
Not Designed For
- Not optimized for lookup. Lookup requires scanning the whole collection or creating separate index values.
- Optimized only for data with a single key and value. A parser is required to store multiple values.
- Ad-hoc or analytical access patterns.
- Synchronous aggregations or referential integrity
AWS References
Amazon DynamoDB Fully managed (serverless) key value database that delivers single-digit millisecond performance at any scale. Multi-region, multi-master database with built-in security, eventual and strong consistency of reads, ACID-compliant for transactional operations across one or more rows, backup and restore, and in-memory caching. Highest levels of availability and scaling elasticity.
I recommend watching this AWS re:Invent video on Dynamodb for Advanced Design Patterns for DynamoDB. I found the Alex DeBrie site useful for Dynamodb.
3. Document Database
In document databases, data is stored in JSON-like documents and JSON documents are first-class objects within the database. These databases make it easier for developers to store and query data by using the same document-model format developers use in their application code. MongoDB is a document database.
A few use cases are
- To maintain Catalogs.
- Content management systems.
- User profiles/personalization.
Advantages
- Flexible, semi-structured, and hierarchical.
- Adjustable to application needs as databases evolve.
- Flexible schema Simple hierarchical and semi-structured data.
- Powerfully index for fast querying.
- Naturally maps documents to object-oriented programming.
- Easily flows data to a persistent layer.
- Expressive query languages built for documents.
- Capable of ad-hoc queries and aggregations across documents.
Not Designed For
Explicitly defined relations between different pieces of data.
AWS References
Amazon DocumentDB Fast, scalable, highly available, and fully managed document database service that supports MongoDB workloads. Designed from the ground up for mission-critical performance, scalability, and availability.
If you are wondering what is the difference between Dynamodb and Documentdb this blog might help you.
4. In-memory databases
An in-memory database is a database management system that primarily relies on main memory for computer data storage. It is contrasted with database management systems that employ a disk storage mechanism.
With the rise of real-time applications, in-memory databases are growing in popularity. In-memory databases predominantly rely on main memory for data storage, management, and manipulation. In-memory has been popularized by open-source software for memory caching, which can speed up dynamic databases by caching data to decrease access latency, increase throughput, and ease the load off the main databases.
A few use cases are
- Caching.
- Session store.
- Leaderboards.
- Geospatial services.
- Pub/sub-Real-time streaming
Advantages
- Sub-millisecond latency.
- Can perform millions of operations per second.
- Significant performance gains when compared to disk-based alternatives.
- Simpler instruction set Support for rich command set (Redis).
- Works with any type of database, relational or non-relational, or even storage services.
Not Designed For
Persisting data to disk all the time.
AWS References
Amazon ElastiCache for Redis: Blazing fast, fully managed in-memory data store compatible with Redis. Provides sub-millisecond latency to power internet-scale, real-time applications.
Amazon ElastiCache for Memcached: Fully managed, in-memory key-value store service compatible with Memcached. Can be used as a cache or a data store. Delivers the performance, ease-of-use, and simplicity of Memcached.
Refer Memcached vs Redis to have a good understanding of the service.
5. Graph databases
In computing, a graph database is a database that uses graph structures for semantic queries with nodes, edges, and properties to represent and store data. A key concept of the system is the graph.
Graph databases are a type of NoSQL database designed to make it easy to build and run applications that work with highly connected datasets. In a graph data model, relationships are first-class citizens, i.e. they are represented directly. Using specialized graph languages, like SPARQL or Gremlin, allows you to easily build queries that efficiently navigate highly connected datasets.
In graph databases, data is stored in the form of nodes, edges, and properties:
- Nodes are equivalent to records in a relational database system.
- Edges represent relationships that connect nodes.
- Properties are additional information added to the nodes In RDF graphs, the concepts of Nodes, Edges, and Properties are represented as Resources with Internationalized Resource Identifiers (IRIs)
A few use cases are
- Fraud detection.
- Social networking Recommendation engines
- Knowledge graphs Data lineage
Advantages
- Ability to make frequent schema changes.
- Quickly make relationships between many different types of data.
- Real-time query response time.
- Superior performance for querying related data–big or small.
- Meets more intelligent data activation requirements.
- Explicit semantics for each query—no hidden assumptions.
- Flexible online schema environment.
Not Designed For
- Applications that do not traverse or query relationships
- Processing high volumes of transactions
- Handling queries that span the entire database.
AWS References
Amazon Neptune, Fast, reliable, fully managed graph database service that makes it easy to build and run applications that work with highly connected datasets.
Neo4j is an open source alternative for Neptune. To find the difference refer Neo4J vs Neptune
6. Time series databases
Time series databases are optimized for time-stamped or time series data. Time series data is very different from other data workloads in that it typically arrives in time-order form, the data is append-only, and queries are always over a time interval. Examples of such data include server metrics, application performance monitoring, network data, sensor data, events, clicks, trades in a market, and many other types of analytics.
A few use cases are
- DevOps Application monitoring.
- Industrial telemetry.
- IoT applications.
Advantages
- Ideal for measurements or events that are tracked, monitored, and aggregated over time.
- High scalability for quickly accumulating time series data Robust usability for many functions, including Data retention policies, Continuous queries and Flexible time aggregations
Not Designed For
Data not in the time order form, such as Documents, Catalogs and Customer profiles
AWS References
Amazon Timestream: Scalable, fully managed, fast time series database service for IoT and operational applications. Enables storage and analysis of trillions of events per day at 1/10th the cost of relational databases.
7. Wide column databases
Wide column databases are massively scalable. They are good for applications that require fast performance and storing large amounts of data. In a wide column database, tables have schemas, and you should define your table schema to match your query patterns (i.e., no joins). Few use cases are
- High scale industrial applications for: Equipment maintenance, Fleet management. and Route optimization
- Data logs.
- Geographic data
Advantages
- Scalable
- Flexible
Not designed for
Running queries across multiple tables by using joins.
AWS References
Amazon Keyspaces (for Apache Cassandra): Scalable, highly available, and managed Apache Cassandra–compatible database service that allows you to use your existing Cassandra Query Language (CQL) code and tools. Serverless solution for building apps that can serve thousands of requests per second with virtually unlimited throughput and storage.
To get further insight on Dynamodb and keyspaces refer to this blog.
Conclusion
The world has changed, and the one-size-fits-all approach of using relational databases as the only store for your applications no longer works. Today’s leading developers are breaking complex applications into microservices and then picking the right purpose-built databases for the right jobs. This ensures that their applications are well architected and scale effectively. Relational databases still play an important role in application design and functionality, but purpose-built database models are designed from the ground up to perform the specific functions modern applications require—quickly and efficiently.