Introduction to Apache Kafka
This article aims at Project leads, developers, admins, and architects who want to understand what is Kafka, what problem it solves and if you can use it in your organization.
I am Debu. I have been working in the field of Big Data for about five years now. I started my career at Accenture working for data-warehousing, team. After that, I was involved in Big Data initiatives using Hadoop at Bank of America.
While working at BoA, it became apparent that relational databases will not be able to provide a viable solution for handling Big Data which possesses these characteristics 3Vs(Volume, Velocity, and Variety). That’s when I started working on integrating other systems like Hadoop with my team’s production systems.
Now I work at V12 Group Inc. as the Project Lead on their Data Science initiative and am architecting their batch and real-time data processing pipeline.
So what is Kafka?
Kafka is a widely popular distributed messaging system, and I say it is amazing. There are other popular messaging systems in the market like RabbitMQ, ActiveMQ. We will discuss the advantage of Kafka, why we need it and the problems it can be used to solve in this article.
Let’s take a look at the origin of Kafka:
Jay Kreps now the CEO of the company Confluent.io had a project to get a working Hadoop setup at LinkedIn. At that time, he was lacking experience in this area and budgeted few weeks for getting data in and out of the cluster and then the rest of the time for implementing fancy algorithms, and later he found that getting data in and out of Hadoop cluster is a more challenging problem.
I have faced a similar situation in my projects again and again. The way I see company requirements is that at enterprise level data is sitting in different sources:
- OLTP
- Web Servers
- Applications
- Monitoring Systems
- Search Systems
- Click Streams
- Ops Metrics
- Security systems
Etc. and companies want to pull all this data into Hadoop. To achieve this, we would have to write custom code to get data from OLTP database or use Sqoop and then for the web server we might set up an FTP protocol or use Flume. There are all these different plugins and connectors for getting all the data from data sources into Hadoop. To me, it seems very obvious because data ingestion into Hadoop from various sources is the most important part when starting up with a Big Data initiative in an organization, but to many businesses, it seems like a very daunting and confusing task.
Clickstream tracking system may be exchanging data with OLTP system and the monitoring system might be exchanging data with Clickstream tracking system and the security system. Making it very complicated to integrate new information source.
This process is the key limiting point in the adoption of Hadoop and other Big Data related technologies into organizations.
Using Kafka, we can simplify the solution of the cross-channel data collection and exchange.
Kafka provides a very simple, robust, fast, highly available, highly scalable, pub-sub messaging system, which in turn helps in solving a bigger problem of communication and integrating various data components in an organization.
Kafka has some cool features:
1. Support for practically a unlimited number of ad-hoc consumers.
For example: If a Data Scientist or Analyst is in the initial process of analyzing the data in the organization, they can create an ad-hoc customer and subscribe to any Kafka topic, consume data and then stop the user without affecting Kafka performance. It is a very powerful feature.
2. Kafka supports a large number of users.
Kafka can support practically unlimited of consumers without any performance overhead. It is a big challenge with traditional messaging systems.
3 . Kafka supports Batch data consumption
In case, we have a process that wants to consume aggregated data on hourly or daily basis then this is easily achievable with Kafka. Traditional messaging systems typically don’t have support for it. This feature makes Kafka integration with Hadoop framework natural as Hadoop is also a batch processing framework.
4. Kafka guarantees very high availability
Kafka internally replicates data across its brokers so even if one broker goes down, we are still guaranteed data availability without any significant performance hit. It’s one of the unique features of Kafka and very useful when we are dealing with high volumes of data.
5. Kafka supports very high message throughput
A small cluster of 3 Kafka brokers can support close to millions of messages per second, which is very critical while handling data from click stream and data sources with a very high message throughput.
Now let’s discuss some use cases of Kafka:
1. Collecting server metrics
In some use cases, we might have 100’s and 1000’s of nodes, and we want to gather metrics from them metrics such as CPU load, Disk IO, Queue Length, etc., we can collect all these metrics into Kafka and then multiple monitoring systems can consume data from Kafka.
2. As a messaging system
If we want one application to notify another app about an event occurrence, then the first application can send an event to Kafka and then the other app can subscribe to Kafka and consume the event.
3. Click Stream tracking
If we have a web application, and we want to track user activity (clicks, page searches, etc.) and analyze it using another system, then we can send all these events to Kafka and Kafka will take care of replicating the events among its brokers with very high performance.
4. Auditing application activity
If we want to audit any user activity within an application and perform analysis on it, later on, we can send it to Kafka.
5. Stream Processing
If we have a continuous stream of data coming in from sensors and want to process events in real time, then Kafka can easily integrate with Apache Storm and Spark Streaming.
Till now we have seen what Kafka is and where we can use it. Now let’s talk about its limitations.
1. Kafka does not support for encryption or authentication as of now.
2. Kafka does not support data transformations right now. We can connect it to systems like Storm to achieve this.
3. Kafka is not a solution that runs out of the box on its own. We need to write custom code for producer and consumer to fit our use case.
In this article, we discussed the key features of Kafka and some use cases in which it can be used in your organization. I hope this article will help you in deciding if Kafka is capable of providing a solution to your enterprise problems and is worth investing development time or not.
Personally, I feel Kafka is very impressive, reliable and robust distributed messaging system which can simplify the development of your Big Data organizational needs.
For anyone who wants to learn more about Kafka and other Big Data related technologies using step by step implementation examples, I will be posting video tutorials on my website www.debusinha.com in coming days.