Apache Kafka

What?

Apache Kafka is a distributed streaming platform, data pipeline, enables real-time data ingestion and messaging, running on clusters to provide scalability, availability, throughput and performance.

Allows to build data pipeline, where the producers can publish the data in topics on partitions, and subscribers can read the data from the appropriate partitions

Let us understand key terminologies

  • producer - applications that sends messages to Kafka 
  • message - simple array of bytes as far as Kafka concerned 
  • consumer - application that reads and process messages from Kafka 
  • cluster- group of machines/nodes that form the Kafka server, each node running an instance of the broker 
  • topic - unique name for a Kafka stream 
  • partition - Kafka create multiple partitions for a topic and stores each partition in a node/machine 

No alt text provided for this image

Why Kafka over Messaging?

Kafka has better throughput, built-in partitioning, replication, and fault-tolerance , which makes it a good solution for large scale message/event processing applications.

Streaming vs. Message Platform 

Stream 

  • Messages/events persisted for specific amount of time based on the retention period configuration
  • Any number of consumers can pull the messages any number of times
  • Supports partitioned consumer pattern, consumer apps has to maintain the counter/offset, from where it has to start reading on the partition.  

Message 

  • Messages are not getting persisted, it gets deleted once read by the consumer/worker apps
  • Once the message read by any one of the consumer, it won’t be available for other consumers.
  • Supports competing consumer pattern – where consumers compete for reading the message from the broker.

Use cases/Scenarios  

Messaging - compared to existing enterprise messaging tools, Kafka can provide the high throughput, low latency and persistence of messages 

Click stream analysis - track the website activities like view, search,… will be published into topics and the subscribers can consumes the data 

Log aggregation -. Kafka abstracts the log files and provide log as stream of messages 

Event sourcing - Kafka is very good backend for event sourcing based applications 

Stream processing - Kafka streams to process the real-time data, does the aggregation and transformed for further consumption

Key Benefits:

  • Kafka can handle millions of events per second.
  • Clustering enables high throughput and availability.
  • Geo-replication on clusters, enables fail-over.
  • Fault-tolerant, performant and scalable.

Challenges:

  • Kafka setup and configuration bit complex but this can be overcome by Managed Kafka services.
  • Lack of full set of monitoring and management tools.

To view or add a comment, sign in

More articles by Murugesan Loganathan

  • Enterprise Architecture

    Enterprise Architecture (EA) plays a critical role in aligning business goals with technology capabilities. It provides…

  • Introduction to Azure - HDInsights

    Overview It is a managed service to host open source Apache Hadoop based platforms on Azure Cloud. It is a cloud…

  • Microservices - An Introduction

    Microservices is an architectural approach to build suite of services, where each service can be developed, maintained,…

  • Cloud Computing
  • internet of things

Others also viewed

Explore content categories