Sharding in MongoDB

  • The word "Shard" means "a small part of a whole". Hence Sharding is the process of dividing a larger part into smaller parts. DB shard is a horizontal partition of data in a database. The system keeps them in physically/logical separate hardware to share the load to gain performance and manageability.


think

What could be the possible ways for horizontal partition of the data, logically or physically?




Let's try with the following sample data.

Assumption: We can assume a single row as a document and columns as attributes.

No alt text provided for this image


This can be achieved by following three ways.

No alt text provided for this image


Zones Sharding: First-way of sharding is based on the department or sharding based on a categorical attribute of the documents. This is very useful when we are interested in categorical retrieval of data. ie we have a country as an essential attribute of documents in a collection and most of the query country name is part of our filter criteria with other attributes.

Ranged Sharding: This kind of sharding is done based on some ranges for salary, or a discrete or continuous value of any attribute in a document. In our example we have a salary (discrete) present at each document, and we assume earning more than 16000 is rich, the query has "rich" or "not rich" as part of our filter criteria with other attributes.

Hashed Sharding. In hash sharding, the system uses some hash function over an attribute and distributes data based on the partition of the hash value. In our example, we are neither interested in a range of the employeeId nor it is a categorical attribute with finite values yet this is an essential part of our filter criteria with other attributes.


No alt text provided for this image


Advantages

 Increased read/write throughput: Multiple shards improve both read and write operation capacity.

 Increased storage capacity: Similarly, by increasing the number of shards, you can also increase overall total storage capacity.

 High availability: Since each shard is a replica set, every piece of data is replicated. Since the data is distributed, even if an entire shard becomes unavailable, the database as a whole remains partially functional for reads and writes from the remaining shards.


Disadvantages

  Latency: Those queries that have more than one shard involved in retrieving results that get extremely slow.

  Sorting issue: As data is indexed and sorted within one shared (system) to gain optimal performance while used local search/sorting, They are not helpful in cross-shard search/sorting queries and result in a slow response or no response.

Inconsistency and non-durability: Due to the more complex failure modes of a set of servers, which often result in systems that do not guarantee cross-shard consistency or durability.

Conclusion :

Point of caution we should evaluate current and upcoming use cases before choosing any sharding, as MongoDB does not provide the luxury of re-sharding. Hope this article gives you a fair understanding of sharding.


Your suggestions/comments are most welcome :).

Good Article Sandeep!!! Very well explained

Spot on , concept explained well with simple example.

Nice Article 🇮🇳 Sandeep Rawat , one more point I would like to add here is, while designing partitioning, data archiving requirements should be considered as well. Otherwise these become a cause of concern for performance.

Like
Reply

To view or add a comment, sign in

More articles by 🇮🇳 Sandeep Rawat

  • Data Quality

    Clive Robert Humby, a well-known British mathematician and entrepreneur coined the phrase “Data is the new oil”…

  • NoSQL System Development ?

    Introduction NOSQL databases (aka "not only SQL") are nontabular and store data differently than relational tables…

  • What why ,how ,when Data lake Data ware house ?

    What why ,how ,when Data lake Data ware house Motivation for data lake (storage) Self serve Innovation (Explore new…

Others also viewed

Explore content categories