Raft - Distributed Systems Consensus Algorithm And its application in the context of Kafka known as KRaft
Distributed System, defined by Leslie Lamport as - “.. A system in which failure of a computer you didn't even know existed, can render your own computer unusable”.
Making Distributed systems - helps to increase Availability, improve Geo proximity and hence performance, and improve fault tolerance of a system
It comes with its own headache of -
Communication among nodes in a distributed system often involves broadcasting messages from one node to another. This is a fundamental aspect of achieving coordination and consistency.
There are different types of broadcast algorithms -
Out of which Total order is the most reliable/perfect ordering guarantee broadcast algorithm, which involves leader based mechanism to send out messages to nodes in the system
One caveat though - How to manage the system when the leader node fails - can be a crash or liveness failure
In such cases, leader election comes into the picture
There are many consensus algorithms to effectively select a leader in a group of nodes - Paxos, Zookeeper Atomic Broadcast (ZAB), Multi-Paxos, Raft (Recent)
In this article, we’ll try to understand Raft as a consensus algorithm and then will also take a look at KRaft - a new Control Plane implementation in Kafka and how it uses Raft for electing leaders for a metadata partition.
Characteristics of the Raft consensus algorithm
As mentioned in image -
State Transition diagram for RAFT
Start State - Follower (Each node in a term starts as a follower and on crash recovery also - comes in a follower state).
Intermediary - Can be a candidate state and ask for Votes.
After receiving quorum votes it becomes a Leader for the system and then can decide the order of messages to be consumed or delivered.
Recommended by LinkedIn
Kafka as a case study for Raft
Kafka is a distributed Log, used for asynchronous message processing. It involves 2 major components in architecture - the Control Plane and the Data Plane.
Control Plane - Responsible for managing metadata for the Kafka cluster.
Data Plane - Responsible for replicating and managing data for the Kafka cluster.
In the New Control Plane implementation called KRaft - Kafka has removed Zookeeper dependency and performs metadata management using an internal single partition topic called __cluster_metadata.
A few set of brokers in a Kafka Cluster acts as control plane nodes - called Controllers in Controller Pool, out of which one Controller which is the leader of the single partition topic is called an active controller.
Note - There is no in-sync replica (ISR) maintenance for this Single partition, as it's on the metadata Plane that decides ISR for data topics.
Hence - A leader crash scenario is crucial and needs a voting technique to decide new leader - in case an active controller crashes.
In KRaft
A term is governed by the epoch time of each controller.
On leader failure, the follower controller transitions to the candidate and makes VoteRequest (Leader Election Request) to Other Controllers.
The image below explains Kraft VotingRequest and Raft Algorithm Voting Requests as a comparison:
When a candidate receives a Grant from (n+1)/2 nodes in a cluster then it broadcasts its leadership election to the cluster
The addition of Kraft in Kafka brings a significant improvement in design - due to the removal of redundant component and making the ecosystem more Kafka analogy-based, by having durability with the use of internal topics.
This article tried covering implementation semantics on a high level with the help of references I read in a few courses
References: