Deconstructing Apache Kafka: A System Design Perspective

Anthon Rodrigues

Published Mar 7, 2026

If you are working with modern, large-scale distributed systems, Apache Kafka is almost unavoidable. Originally detailed in a 2011 white paper by engineers at LinkedIn, Kafka fundamentally changed how we handle massive streams of data.

Here is a breakdown of Kafka’s architecture from a system design perspective.

1. The Problem Kafka Solved

Before Kafka, data integration was a point-to-point mess. Every data source (databases, app logs) was directly wired to every destination (data warehouses, monitoring dashboards). Traditional message queues (like RabbitMQ) stored messages in memory and deleted them once read, failing under the massive throughput required for web-scale event tracking.

The Solution: A unified, high-throughput, decoupled platform built on a fundamentally different abstraction: The Distributed Commit Log. Kafka appends immutable data to the end of a file sequentially, taking advantage of modern OS disk caching to make disk reads/writes incredibly fast.

2. Core Architecture: The Basics

Kafka's scalability relies on a few core primitives:

Topics & Partitions: A topic (e.g., user_clicks) is logically how data is categorized, but physically, it is split into Partitions. Partitions are distributed across different servers (brokers). Data inside a partition is strictly ordered and assigned a sequential ID called an offset.
Producers & Consumers: Producers write data to topics (often hashing a key like user_id to ensure related events hit the same partition). Consumers read data, tracking their place using the offset.
Consumer Groups: This is how Kafka scales consumption. A consumer group is a team of consumers reading a topic. The Golden Rule: Each partition can be read by exactly one consumer within a single group. If you add more consumers than partitions, the extra consumers sit idle.

3. Data Management: Compaction and Tombstones

Unlike traditional queues, Kafka retains messages for a configured time (e.g., 7 days) or size (e.g., 50GB) before deleting the oldest segments.

For scenarios where you only care about the latest state (like streaming database updates), Kafka uses Log Compaction. A background thread removes older records that share the same key as a newer record. To delete a record entirely, producers send a Tombstone—a message with the target key but a null payload. The compactor uses this as a delete marker to eventually scrub the key from the system.

Recommended by LinkedIn

Apache Kafka Demystified: Everything You Need to Know…

Nabil (Bill) Moussli 1 year ago

Navigating the World of Apache Kafka in Modern Data…

Akash Satpute 2 years ago

Navigating the Complexities of Apache Kafka: Unveiling…

Helder Lee 2 years ago

4. High Availability and Fault Tolerance

Kafka survives broker crashes through replication, configured by a Replication Factor.

Leader and Followers: One broker is the Leader for a partition, handling all reads and writes. The others are Followers, passively replicating the data.
The ISR (In-Sync Replicas): Kafka tracks which followers are actively keeping up with the leader. Only brokers in the ISR are eligible to become the new leader if the current one crashes, preventing data loss.
The Shift to KRaft: Historically, Kafka relied on an external system, Apache ZooKeeper, to manage metadata and leader elections. This became a bottleneck at scale. Modern Kafka uses KRaft (Kafka Raft), integrating a consensus protocol directly into Kafka. Metadata is now treated as an internal Kafka topic, allowing for instant failovers and the ability to scale to millions of partitions.

5. Exactly-Once Semantics (EOS)

In a distributed system, network timeouts are inevitable. If a producer's message is written, but the network drops the acknowledgment (ack), the producer must retry, potentially creating duplicate data.

Kafka solves this through:

The Idempotent Producer: Kafka assigns producers a unique ID and requires them to send sequence numbers with messages. If a broker receives a duplicate sequence number due to a retry, it silently drops the duplicate but sends back a successful ack.
Transactions: For complex "Read-Process-Write" loops, Kafka uses a Transaction Coordinator and a Transaction Log (similar to a two-phase commit). It writes a "Commit Marker" to the topic only when the entire transaction succeeds.
Read-Committed Consumers: Consumers can be configured to pause reading if they hit an open transaction, waiting for the commit or abort marker before proceeding.

Kafka’s design is a masterclass in leveraging simple file system mechanics and clever distributed consensus to solve incredibly complex data problems at scale.

References:

Kreps, J., Narkhede, N., & Rao, J. (2011). "Kafka: a Distributed Messaging System for Log Processing." Proceedings of the 6th International Workshop on Networking Meets Databases (NetDB '11).
Apte, N., & Gustafson, J. (2017). "KIP-98 - Exactly Once Delivery and Transactional Messaging." Apache Kafka Wiki.
Rooman-Kymans, C., et al. (2019). "KIP-500: Replace ZooKeeper with a Self-Managed Metadata Quorum." Apache Kafka Wiki.
Apache Software Foundation. (2024). "Apache Kafka Documentation." Apache Kafka Official Website.
System Design Course: Gaurav Sen System Design

Abhishek Barkade 1mo

Really clean breakdown of Kafka's core concepts. The part about Exactly-Once Semantics is what most tutorials skip completely. Duplicate messages caused by network retries is a real production problem — and most engineers only discover it after something goes wrong in their payment or order system. Idempotent producers solve it cleanly without any extra complexity on the consumer side.

1 Reaction

To view or add a comment, sign in

Deconstructing Apache Kafka: A System Design Perspective

Anthon Rodrigues

1. The Problem Kafka Solved

2. Core Architecture: The Basics

3. Data Management: Compaction and Tombstones

Recommended by LinkedIn

4. High Availability and Fault Tolerance

5. Exactly-Once Semantics (EOS)

More articles by Anthon Rodrigues

Others also viewed

Introduction to Apache Kafka

Apache Kafka Architecture: Core Components, Data Flow & Design Principles

APACHE KAFKA

KAFKA

Benefits of Using Apache Kafka Over RabbitMQ

Top 10 Kafka Interview Questions and Answers 2025 Edition

Why Apache Kafka has become more popular and in demand than many other messaging systems?

Apache Kafka - Know everything about it in 5 minutes

THE SECRET OF APACHE KAFKA

Explore content categories

1. The Problem Kafka Solved

2. Core Architecture: The Basics

3. Data Management: Compaction and Tombstones

Recommended by LinkedIn

4. High Availability and Fault Tolerance

5. Exactly-Once Semantics (EOS)

More articles by Anthon Rodrigues

Decoding the Architecture of Live Sports Streaming: How to Serve Millions Without Crashing

Engineering at Scale: Inside Meta’s Memcache Architecture

The Architecture of Exabyte Scale: Deconstructing Google’s Internal Stack (Borg, Colossus, & The Zanzibar Deep Dive)

Exploring AI and Machine Learning with Python: Insights & Challenges

Others also viewed

Introduction to Apache Kafka

Apache Kafka Architecture: Core Components, Data Flow & Design Principles

APACHE KAFKA

KAFKA

Benefits of Using Apache Kafka Over RabbitMQ

Top 10 Kafka Interview Questions and Answers 2025 Edition

Why Apache Kafka has become more popular and in demand than many other messaging systems?

Apache Kafka - Know everything about it in 5 minutes

THE SECRET OF APACHE KAFKA

Similar topics

How to Understand Kafka Architecture

Explore content categories