Data Replication and Partitioning

Suraj Patil

Published Jul 10, 2020

The goal of partitioning is to spread the data and query load evenly across multiple machines, avoiding hot spots (nodes with disproportionately high load). There are various reasons why you might want to distribute a database across multiple machines —

Scalability: If your data volume, read load, or write load grows bigger than a single machine can handle, you can potentially spread the load across multiple machines.
Fault tolerance/high availability: If your application needs to continue working even if one machine (or several machines, or the network, or an entire datacenter) goes down, you can use multiple machines to give you redundancy. When one fails, another one can take over.
Latency: If you have users around the world, you might want to have servers at various locations worldwide so that each user can be served from a datacenter that is geographically close to them. That avoids the users having to wait for network packets to travel halfway around the world.

Doing this requires choosing a partitioning scheme that is appropriate to your data, and rebalancing the partitions when nodes are added to or removed from the cluster. There are two common ways data is distributed across multiple nodes—

Replication: Keeping a copy of the same data on several different nodes, potentially in different locations. Replication provides redundancy — if some nodes are unavailable, the data can still be served from the remaining nodes. Replication can also help improve performance.
Partitioning: Splitting a big database into smaller subsets called partitions so that different partitions can be assigned to different nodes (also known as sharding).

Replication

There are many trade-offs to consider with replication: for example, whether to use synchronous or asynchronous replication, and how to handle failed replicas. Those are often configuration options in databases, and although the details vary by database, the general principles are similar across many different implementations. All of the difficulty in replication lies in handling changes to replicated data. There are three popular algorithms for replicating changes between nodes — single-leader, multi-leader, and leaderless replication.

Single-Leader Replication

Each node that stores a copy of the database is called a replica. With multiple replicas, a question inevitably arises: how do we ensure that all the data ends up on all the replicas? Every write to the database needs to be processed by every replica; otherwise, the replicas would no longer contain the same data. The most common solution for this is called leader-based replication (also known as active/passive or master–slave replication).

One of the replicas is designated the leader (also known as master or primary).When clients want to write to the database, they must send their requests to the leader, which first writes the new data to its local storage. The other replicas are known as followers (read replicas, slaves, secondaries, or hot standbys). Whenever the leader writes new data to its local storage, it also sends the data change to all of its followers as part of a replication log or change stream. Each follower takes the log from the leader and updates its local copy of the database accordingly, by applying all writes in the same order as they were processed on the leader. And from time to time, you need to set up new followers — perhaps to increase the number of replicas, or to replace failed nodes.

This mode of replication is a built-in feature of many relational databases, such as PostgreSQL (since version 9.0), MySQL, Oracle Data Guard, and SQL Server’s AlwaysOn Availability Groups. It is also used in some nonrelational databases, including MongoDB, RethinkDB, and Espresso. Finally, leader-based replication is not restricted to only databases: distributed message brokers such as Kafka and RabbitMQ highly available queues also use it. Some network filesystems and replicated block devices such as DRBD are similar.

Multi-Leader Replication

Leader-based replication has one major downside: there is only one leader, and all writes must go through it. If you can’t connect to the leader for any reason, for example due to a network interruption between you and the leader, you can’t write to the database. A natural extension of the leader-based replication model is to allow more than one node to accept writes. Imagine you have a database with replicas in several different data centers (perhaps so that you can tolerate failure of an entire datacenter, or perhaps in order to be closer to your users). With a normal leader-based replication setup, the leader has to be in one of the data centers, and all writes must go through that datacenter. In a multi-leader configuration, you can have a leader in each data center. Within each datacenter, regular leader-follower replication is used; between data centers, each data center’s leader replicates its changes to the leaders in other data centers.

Some databases support multi-leader configurations by default, but it is also often implemented with external tools, such as Tungsten Replicator for MySQL, BDR for PostgreSQL, and GoldenGate for Oracle. Although multi-leader replication has advantages, it also has a big downside: the same data may be concurrently modified in two different data centers, and those write conflicts must be resolved.

A replication topology describes the communication paths along which writes are propagated from one node to another. The most general topology is all-to-all, in which every leader sends its writes to every other leader. However, more restricted topologies are also used: for example, MySQL by default supports only a circular topology, in which each node receives writes from one node and forwards those writes (plus any writes of its own) to one other node. Another popular topology has the shape of a star: v one designated root node forwards writes to all of the other nodes. The star topology can be generalized to a tree.

Partitioning

Data replication is basically having multiple copies of the same data on different nodes. It is useful for small datasets served locally. But for very large datasets, or very high query throughput, that is not sufficient — we need to break the data up into partitions, also known as Sharding. Normally, partitions are defined in such a way that each piece of data (each record, row, or document) belongs to exactly one partition. The main reason for wanting to partition data is scalability. Different partitions can be placed on different nodes in a shared-nothing cluster. Thus, a large dataset can be distributed across many disks, and the query load can be distributed across many processors.

Partitioned databases were pioneered in the 1980s by products such as Teradata and Tandem NonStop SQL, and more recently rediscovered by NoSQL databases and Hadoop-based data warehouses. What we call a partition here is called a shard in MongoDB, Elasticsearch, and SolrCloud; it’s known as a region in HBase, a tablet in Bigtable, a vnode in Cassandra and Riak, and a vBucket in Couchbase.

Partitioning by Key Range

One way of partitioning is to assign a continuous range of keys (from some minimum to some maximum) to each partition, like the volumes of a paper encyclopedia figure below. If you know the boundaries between the ranges, you can easily determine which partition contains a given key. If you also know which partition is assigned to which node, then you can make your request directly to the appropriate node (or, in the case of the encyclopedia, pick the correct book off the shelf). This partitioning strategy is used by Bigtable, its open source equivalent HBase, RethinkDB, and MongoDB before version 2.4. Within each partition, we can keep keys in sorted order. This has the advantage that range scans are easy, and you can treat the key as a concatenated index in order to fetch several related records in one query. For example, consider an application that stores data from a network of sensors, where the key is the timestamp of the measurement (year-month-day-hour-minute-second). Range scans are very useful in this case, because they let you easily fetch, say, all the readings from a particular month.

However, the downside of key range partitioning is that certain access patterns can lead to hot spots. If the key is a timestamp, then the partitions correspond to ranges of time — e.g., one partition per day. Unfortunately, because we write data from the sensors to the database as the measurements happen, all the writes end up going to the same partition (the one for today), so that partition can be overloaded with writes while others sit idle.

Partitioning by Hash of Key

Because of this risk of skew and hot spots, many distributed datastores use a hash function to determine the partition for a given key. A good hash function takes skewed data and makes it uniformly distributed. Whenever you give it a new string, it returns a seemingly random number between 0 and 2³² − 1. Even if the input strings are very similar, their hashes are evenly distributed across that range of numbers. For partitioning purposes, the hash function need not be cryptographically strong: for example, Cassandra and MongoDB use MD5, and Voldemort uses the FowlerNoll–Vo function. Many programming languages have simple hash functions built in (as they are used for hash tables), but they may not be suitable for partitioning: for example, in Java’s Object.hashCode() and Ruby’s Object#hash, the same key may have a different hash value in different processes.

Partitioning by Secondary Indexes (by Document)

The partitioning schemes discussed so far rely on a key-value data model. If records are only ever accessed via their primary key, we can determine the partition from that key and use it to route read and write requests to the partition responsible for that key. The situation becomes more complicated if secondary indexes are involved. A secondary index usually doesn’t identify a record uniquely but rather is a way of searching for occurrences of a particular value: find all actions by user 123, find all articles containing the word hogwash, find all cars whose color is red, and so on. Secondary indexes are the bread and butter of relational databases, and they are common in document databases too.

You want to let users search for cars, allowing them to filter by color and by make, so you need a secondary index on color and make (in a document database these would be fields; in a relational database they would be columns). If you have declared the index, the database can perform the indexing automatically. ii For example, whenever a red car is added to the database, the database partition automatically adds it to the list of document IDs for the index entry color:red. In this indexing approach, each partition is completely separate — each partition maintains its own secondary indexes, covering only the documents in that partition. It doesn’t care what data is stored in other partitions. Whenever you need to write to the database — to add, remove, or update a document — you only need to deal with the partition that contains the document ID that you are writing. For that reason, a document-partitioned index is also known as a local index (as opposed to a global index).

Partitioning by Secondary Indexes (by Term)

Rather than each partition having its own secondary index (a local index), we can construct a global index that covers data in all partitions. However, we can’t just store that index on one node, since it would likely become a bottleneck and defeat the purpose of partitioning. A global index must also be partitioned, but it can be partitioned differently from the primary key index. Figure 6–5 illustrates what this could look like — red cars from all partitions appear under color:red in the index, but the index is partitioned so that colors starting with the letters a to r appear in partition 0 and colors starting with s to z appear in partition 1. The index on the make of car is partitioned similarly (with the partition boundary being between f and h). We call this kind of index term-partitioned, because the term we’re looking for determines the partition of the index. Here, a term would be color:red, for example. The name term comes from full-text indexes (a particular kind of secondary index), where the terms are all the words that occur in a document.

As before, we can partition the index by the term itself, or using a hash of the term. Partitioning by the term itself can be useful for range scans (e.g., on a numeric property, such as the asking price of the car), whereas partitioning on a hash of the term gives a more even distribution of load. The advantage of a global (term-partitioned) index over a document-partitioned index is that it can make reads more efficient — rather than doing scatter/gather over all partitions, a client only needs to make a request to the partition containing the term that it wants. However, the downside of a global index is that writes are slower and more complicated, because a write to a single document may now affect multiple partitions of the index (every term in the document might be on a different partition, on a different node).

References

Kleppmann, Martin. Designing Data Intensive Applications. O’Reilly, 2017.

Shamthosh D. 1y

This is actually the best way I’ve ever seen this explained

1 Reaction

To view or add a comment, sign in

Data Replication and Partitioning

Suraj Patil

Replication

Single-Leader Replication

Multi-Leader Replication

Partitioning

References

More articles by Suraj Patil

Others also viewed

Multi-Leader Replication | Introduction and possible Use Cases

The New Data Services

#High Availability

Essential Resiliency Best Practices for Protecting Your Blob Storage Data

Part 4: The Portability Revolution: Zero-Rewrite Relocation and Universal CDC

Single Leader Replication algorithm - Discussing Replication schemes and Fault Tolerance

Replication in Distributed Systems

February 6, 1997: When Storage Stopped Being a Sidecar

Thinking different about data storage

The Delphix Data Platform: Data Virtualization

Explore content categories

Replication

Single-Leader Replication

Multi-Leader Replication

Partitioning

References

More articles by Suraj Patil

An Overview of Big Data Analytics in Healthcare

Data Warehousing: OLTP vs OLAP Queries

Others also viewed

Multi-Leader Replication | Introduction and possible Use Cases

The New Data Services

#High Availability

Essential Resiliency Best Practices for Protecting Your Blob Storage Data

Part 4: The Portability Revolution: Zero-Rewrite Relocation and Universal CDC

Single Leader Replication algorithm - Discussing Replication schemes and Fault Tolerance

Replication in Distributed Systems

February 6, 1997: When Storage Stopped Being a Sidecar

Thinking different about data storage

The Delphix Data Platform: Data Virtualization

Explore content categories