Kafka for Data Engineers
Kafka is the prominent queuing system which is the most used technology in all streaming solutions. In most of my streaming use cases at Walmart, I use Kafka either to consume records from it or produce to it.
But, What makes Kafka most used and how it is different from the traditional queueing system?
In my working experience, there are mainly two biggest advantages of Kafka
Using Kafka is a good way to eliminate impedance mismatches between producers and consumers. Each one can proceed at its own pace without impacting other parts of the ecosystem, making it a robust enterprise infrastructure software. Interactions via Kafka are not point-to-point allowing true decoupling of systems. It follows the publisher-subscriber model instead of just the producers and consumers model.
Here producer can keep producing the data at its own pace and different consumers can keep consuming by subscribing to the topic.
We can retain the data in the Kafka cluster for a certain period of time.
Data or Records in the Kafka cluster will not be lost even if some consumers consume that records.
We can always add new consumers or remove old consumers.
So it's the ability to retain the data and decouples the producer and consumer was a major breakthrough in the big data industry.
But, Ankur before understanding advantage, We shall understand, what Kafka is. Am I right😅?
Yup, you are right. Let me try to break it down for you.
According to Confluent(Kafka's biggest service provider)
Apache Kafka is an open-source distributed streaming system used for stream processing, real-time data pipelines, and data integration at scale. Originally created to handle real-time data feeds at LinkedIn in 2011, Kafka quickly evolved from messaging queue to a full-fledged event streaming platform capable of handling over 1 million messages per second, or trillions of messages per day.
Ohh again Ankur, too many technical terms😁, How does Data Engineer use Kafka? Will you please explain it in little raw terms?
Ohh ok, I suppose you have read my previous articles on spark streaming where I was using a console or terminal to produce the streaming data for demo purposes. If you have not read my previous article I highly recommend you to read it.
Please find the links below.
Recommended by LinkedIn
So in our previous example, we were trying to read data from a socket using a terminal for our learning purpose
Here Buffering capabilities mean that if a producer is producing data at a larger speed & if a consumer is not able to consume the data quickly then the producer should have the capability of holding or buffering the data.
This capability is provided by Kafka as queuing system. Kafka can't retain the data for a certain period of time. Here the default is mostly 7 days. It provides buffering capability.
Kafka helps to decouple the producer and consumer which means that the producer and consumer can work independently. Both are not necessary to be synced always but it's a good practice to sync the producer and consumer because if it is not synced then it can create Lag for the use case.
Kafka also has the capability of processing the stream data but we will not talk about it right now. For now, we are using Spark Streaming or Spark Structure Streaming for all our processing work.
So, We have now understood that Kafka is a publisher-subscriber system. Let me draw one diagram for you.
You can check from the above diagram that
We have read a lot about Kafka's definition and its architecture for now. Let's try to see some important terms of Kafka which a Data Engineer should understand.
I hope you are now clear about the basic fundamental of Kafka. Let's meet in the next section to try to cover the following things.
I have already written one article on managing disasters for Kafka. Please go through this link if you are planning to implement Disaster Recovery for your Kafka-based system and make uptime more than 99%.
Feel free to subscribe to my YouTube channel i.e The Big Data Show. I might upload a more detailed discussion of the above concepts in the coming days.
More so, thank you for that most precious gift to a me as writer i.e. your time.
thank you Ankur Ranjan for this breakdown. Your YouTube channel is also very practical 🔥
Good Read. Nice work Ankur Ranjan
The number of partitions for a selected topic can be changed. But the change can only increase them.