System Design Interviews - Hash based data distribution and Intro to Consistent Hashing

System Design Interviews - Hash based data distribution and Intro to Consistent Hashing

In this article, I will be covering following topics -

  • Hash based distribution of data among the database servers.
  • Limitations of simple hash based data distribution in case of addition/removal of database server.
  • Introduction to consistent hashing and how it can help counter the limitations of simple hash based data distribution.

Suppose we have 4 database servers namely S0, S1, S2, S3. Now we want to distribute set of usernames = ["Raj", "Kartik", "Kaivalya", "Karan", "Isha", "Megha"] among them.

One of the simple approach here is to use simple hash based data distribution , in this approach the server number assigned to a username will be computed as follows -

server number = hashfunc(username) % number_of_servers

Note - Hash function used should be deterministic i.e. For a specific information value, a hash function should continuously provide the same hash value.

Let us suppose following are the values given by our hashfunc for each username -

No alt text provided for this image

Now the distribution of usernames among servers will be as follows -

No alt text provided for this image

As our hash function used is deterministic, we can easily retrieve the data from correct server by using the same formulae

server number = hashfunc(username) % number_of_servers .

Question - But what if we scale up(add another database server) or scale down(remove/failure a database server) ?

Case 1 - Server goes down/removed

Lets suppose one of the server goes down, now number_of_servers = 3.

No alt text provided for this image


Case 2 - Adding a new database server S4 so now number_of_servers = 5.

No alt text provided for this image

Disadvantages -

  • You can observe that addition/removal of new server led to drastic data redistribution which is highly inefficient as in worst case all the data/usernames may need to be redistributed and that can be millions of records.
  • In the meantime of redistribution, there will be massive read misses, rendering system non-available.

Here comes the consistent hashing in picture.

Suppose the servers are placed like the orientation below -

For now just ignore what this ring is and how servers are placed on this ring and how data is distributed. Just note that -

No alt text provided for this image

Raj, Kartik are placed at S1.

Isha is placed at S2.

Karan, Kaivalya is placed at S3.

Megha is placed at S0.

Removal/Failure of Server

Now suppose server S1 goes down so what you will do is you will retrieve data placed at S1 and put the data on clockwise next server that is S2 in this case. So now it will look like follows -

No alt text provided for this image

So now Raj, Kartik, Isha are present on server S2, data on all other servers are untouched only S1 data is redistributed.

Addition of server

Now suppose in the initial configuration of 4 servers, one more server S4 is added b/w S2 and S3. So now username "kaivalya" will be stored to S4(clockwise next to s2) instead of S3.

No alt text provided for this image

So consistent hashing prevents drastic redistribution of data compared to simple hash based approach.

In next article we will take an example and see consistent hashing in depth-

1) How consistent hash ring is formed?

2) How data is stored in consistent hashing?

3) How replication are stored in consistent hashing?

4) What happens when we scale up/scale down?

5) How to distribute data in balanced way?

Until then if you like the article, put a like, repost the article and subscribe my newsletter.

Cheers

Kartik Sapra

#systemdesign #programming #amazon #microsoft #google #softwaredevelopement #softwaredev #faang #maang #techinterview #systemarchitect #systemarchitecture #interviewpreparation #technicalwriter #technicalwriting

To view or add a comment, sign in

More articles by Kartik S.

Others also viewed

Explore content categories