Raft - Distributed Systems Consensus Algorithm And its application in the context of Kafka known as KRaft

Venkatesh Wagh

Published Nov 4, 2023

Distributed System, defined by Leslie Lamport as - “.. A system in which failure of a computer you didn't even know existed, can render your own computer unusable”.

Making Distributed systems - helps to increase Availability, improve Geo proximity and hence performance, and improve fault tolerance of a system

It comes with its own headache of -

Network Unreliabilities
Node Unreliabilities
Time, Synchronization Challenges

Communication among nodes in a distributed system often involves broadcasting messages from one node to another. This is a fundamental aspect of achieving coordination and consistency.

There are different types of broadcast algorithms -

Eager Reliable/Gossip Algorithm
FIFO Broadcast algorithm
Causal (Relative/Event) broadcast
Total Order Broadcast

Out of which Total order is the most reliable/perfect ordering guarantee broadcast algorithm, which involves leader based mechanism to send out messages to nodes in the system

Article content — Label: Characteristics of Total Order Broadcast Algorithm

One caveat though - How to manage the system when the leader node fails - can be a crash or liveness failure

In such cases, leader election comes into the picture

There are many consensus algorithms to effectively select a leader in a group of nodes - Paxos, Zookeeper Atomic Broadcast (ZAB), Multi-Paxos, Raft (Recent)

In this article, we’ll try to understand Raft as a consensus algorithm and then will also take a look at KRaft - a new Control Plane implementation in Kafka and how it uses Raft for electing leaders for a metadata partition.

Characteristics of the Raft consensus algorithm

As mentioned in image -

It detects crash or failure detection for leader node (majorly by heartbeats)
Avoids having two leaders for a term (term is limited duration of time)
On every election, term is incremented by 1
Every candidate requires - quorum votes ((n+1)/2) to become a Leader

State Transition diagram for RAFT

Start State - Follower (Each node in a term starts as a follower and on crash recovery also - comes in a follower state).

Intermediary - Can be a candidate state and ask for Votes.

After receiving quorum votes it becomes a Leader for the system and then can decide the order of messages to be consumed or delivered.

Recommended by LinkedIn

Quantum Teleportation for Distributed Systems Using…

Dr. Brindha Jeyaraman 1 year ago

Spark Stateful Stream Processing with mapWithState

Ajay Mall 6 years ago

Building a Fault-Tolerant Distributed Key-Value Store…

Ihor Melnichenko 2 months ago

Kafka as a case study for Raft

Kafka is a distributed Log, used for asynchronous message processing. It involves 2 major components in architecture - the Control Plane and the Data Plane.

Control Plane - Responsible for managing metadata for the Kafka cluster.

Data Plane - Responsible for replicating and managing data for the Kafka cluster.

In the New Control Plane implementation called KRaft - Kafka has removed Zookeeper dependency and performs metadata management using an internal single partition topic called __cluster_metadata.

A few set of brokers in a Kafka Cluster acts as control plane nodes - called Controllers in Controller Pool, out of which one Controller which is the leader of the single partition topic is called an active controller.

Note - There is no in-sync replica (ISR) maintenance for this Single partition, as it's on the metadata Plane that decides ISR for data topics.

Hence - A leader crash scenario is crucial and needs a voting technique to decide new leader - in case an active controller crashes.

In KRaft

A term is governed by the epoch time of each controller.

On leader failure, the follower controller transitions to the candidate and makes VoteRequest (Leader Election Request) to Other Controllers.

The image below explains Kraft VotingRequest and Raft Algorithm Voting Requests as a comparison:

When a candidate receives a Grant from (n+1)/2 nodes in a cluster then it broadcasts its leadership election to the cluster

The addition of Kraft in Kafka brings a significant improvement in design - due to the removal of redundant component and making the ecosystem more Kafka analogy-based, by having durability with the use of internal topics.

This article tried covering implementation semantics on a high level with the help of references I read in a few courses

References:

Distributed Systems course by Martin Kleppman, Notes
Kafka Internals Course By Jun Rao, Course
All the images are taken from the above two references, no copyright infringement is intended

To view or add a comment, sign in

Raft - Distributed Systems Consensus Algorithm And its application in the context of Kafka known as KRaft

Venkatesh Wagh

Recommended by LinkedIn

More articles by Venkatesh Wagh

Others also viewed

Kafka: Real-time Data Streaming Powerhouse

Gossip Protocol for Distributed Systems

Distributed Tracing using OpenTracing and Jaeger

Streaming Kubernetes Events to Kafka: Part I

Observability in Distributed Systems

Consistency Wars: Strong vs. Eventual Consistency in Distributed Systems

Understanding Paxos and Optimizations for Large-Scale Distributed Systems

Distributed Tracing: Understanding the Journey of a Request

Apache Kafka: Your Vibrant Guide to Real-Time Data Streaming

Explore content categories

Recommended by LinkedIn

More articles by Venkatesh Wagh

How to manage yourself ?

Burnout | Unintentional Choices | Auto-pilot

Generating Unique Sequence across Kafka Stream Processors

Others also viewed

Kafka: Real-time Data Streaming Powerhouse

Gossip Protocol for Distributed Systems

Distributed Tracing using OpenTracing and Jaeger

Streaming Kubernetes Events to Kafka: Part I

Observability in Distributed Systems

Consistency Wars: Strong vs. Eventual Consistency in Distributed Systems

Understanding Paxos and Optimizations for Large-Scale Distributed Systems

Distributed Tracing: Understanding the Journey of a Request

Apache Kafka: Your Vibrant Guide to Real-Time Data Streaming

Explore content categories