Getting started with Apache Kafka
To start with, let us understand what is Apache Kafka?
Apache Kafka is a distributed streaming platform with 3 capabilities
- Messaging System
- Store Stream with fault tolerance
- Process the stream data
Well, let us take a moment to understand each of them
Messaging System:
It is a message bus developed for high Ingres data, it allows access to published data and if needed replay the data i.e. It allows applications to process, persist and re-process streamed data.
We can actually divide huge projects in a small micro-services and use Kafka to communicate between these micro services
Store Stream with fault tolerance:
Since Kafka is distributed system, we can divide the data to be stored on different broker using replication_factor, if we have set the replication factor as 3, we can tolerate 2 node failures, in general the formula is replication_factor -1
Process the stream data:
Kafka provides streaming API to do data processing, I will cover that in future article
Basic concept of Kafka:
Topic: Unique name of a feed
Record: Smallest data and made up of key, value and timestamp
Partition: An ordered sequence of immutable record
Offset: Sequential ID assigned to Record
Broker: A node in a distributed system which forms Kafka Cluster
Broker ID: Each node is assigned with unique identifier
We have lot more like leader, group_id, replication_factor etc, which I would cover in other article
Installation and configuration of 3 Node cluster, since i am using Mac I will show in Mac, the command may not differ a lot in Linux
- wget http://apachemirror.wuchna.com/kafka/2.5.0/kafka_2.12-2.5.0.tgz (the URL will differ depending on version you want to install)
- tar -xvf kafka_2.12-2.5.0.tgz
- cd kafka_2.12-2.5.0
- bin/zookeeper-server-start.sh config/zookeeper.properties
create 3 copies of Kafka server configuration by copying and modifying config/server.properties file as config/server1.properties and server2.properties
- Change broker.id (just increment integer by 1)
- change port(just increment integer by 1)
- also good to change log_dir
then start all the 3 servers as follows:
- bin/kafka-server-start.sh config/server.properties
- bin/kafka-server-start.sh config/server1.properties
- bin/kafka-server-start.sh config/server2.properties
Cool we have the servers up and running, I would recommend to look all the scripts in bin folder, it gives you the best tools to manage your Kafka Cluster
Let's create a topic:
We can you the tool bin/kafka-topics.sh to create list describe etc the topics
Command bin/kafka-topics.sh --zookeeper localhost:2181 --create --topic fault_tolerated_topic --partition 3 --replication-factor 3
--create is an option to create --topic <name of the feed>
--partitions will tell how many partition the data would be done to
--replication_factor defines the level of fault tolerance
Awesome, we just created out first topic, let's see the detail of it with help of --describe option
bin/kafka-topics.sh --zookeeper localhost:2181 --describe --topic fault_tolerated_topic
Topic: fault_tolerated_topic PartitionCount: 3 ReplicationFactor: 3 Configs:
Topic: fault_tolerated_topic Partition: 0 Leader: 2 Replicas: 2,0,1 Isr: 2,0,1
Topic: fault_tolerated_topic Partition: 1 Leader: 0 Replicas: 0,1,2 Isr: 0,1,2
Topic: fault_tolerated_topic Partition: 2 Leader: 1 Replicas: 1,2,0 Isr: 1,2,0
All right too much to grasp, let me continue from here in my next article on Kafka, for time being just keep playing with your Kafka setup :)