APACHE KAFKA
What is Apache Kafka:
Apache Kafka is an open-source distributed event streaming platform used for building real-time data pipelines and streaming applications. It utilizes non-destructive consumer semantics and is designed to handle large volumes of data and provide high-throughput, low-latency and fault-tolerant messaging.
In simple words – Kafka is a messaging system that is capable of receiving, storing and emitting large number of messages in a short amount of time in a highly reliable manner.
What problems does Kafka solve:
Apache Kafka solved scalability, throughput, and fault tolerance challenges by introducing a distributed architecture with efficient partitioning and replication. It minimized latency for real-time data processing and offered durable storage, allowing organizations to handle high volumes of data with low latency and ensure continuous operation in the face of failures. Additionally, Kafka provided built-in stream processing capabilities, simplifying real-time data analytics and enabling seamless integration into modern data architectures.
Components of Kafka:
Producer: Applications that send data (messages) to Kafka topics. Producers publish records to one or more Kafka topics.
Consumer: Applications that read data from Kafka topics. Consumers subscribe to one or more topics and process the records produced to those topics.
Broker: Kafka runs as a cluster of one or more servers, each of which is called a broker. Brokers are responsible for storing and serving the data, as well as handling client requests.
Topic: A category or feed name to which records are published by producers. Topics in Kafka are divided into partitions for scalability.
Partition: Each topic can be split into partitions, which are ordered and immutable sequences of records. Partitions allow data to be distributed across multiple brokers.
Offset: Each record within a partition is assigned a unique identifier called an offset, which represents its position in the partition.
Replica: Kafka maintains redundancy and fault tolerance through replicas. Each partition can have multiple replicas, which are copies of the data stored on different brokers.
Consumer Group: Consumers are organized into consumer groups. Each consumer group contains one or more consumers that jointly consume all the partitions of the topics they subscribe to.
ZooKeeper: Kafka uses ZooKeeper for managing and coordinating the Kafka brokers. It stores metadata about the cluster, such as broker information, topic configuration, and partition assignment.
Recommended by LinkedIn
Connectors: Kafka Connect is a framework for connecting Kafka with external systems such as databases, message queues, and file systems. Connectors are plugins that enable the integration of Kafka with various data sources and sinks.
Understanding Kafka with simple example:
Here is a simple explanation of Kafka using Restaurant analogy:
Understanding how Kafka works at a high level:
1. A restaurant has a menu which has sections. Similarly, Kafka has topics and partitions.
2. A dine in customer walks in and the server hands them the menu. The customer reads through the menu and places order(s) which the server notes down represented by an order number. Server noting down the order is similar to a consumer subscribing to a particular topic/ partition in Kafka.
3. Chefs prepare dishes that belong to a menu/ section and store it on the Kitchen Counter. Similarly, Producer publish messages that belong to a topic / partition and store it on a broker (a Kafka instance).
4. Server checks the kitchen counter to see what dishes are ready to be served and serves the dish to the customer. Once all the dishes ordered are served, the server marks the order number as completed. This is equivalent to Kafka modifying the offset after the batch of messages are processed and delivered.
The reliability of the above process is taken care by Replica, Zookeeper, Consumer group and connectors.