Re-define data integration with Kafka Stream

Re-define data integration with Kafka Stream

The basic concept of ETL has been in the industry for several years but as the exchanged data grows, it is slowly pushed to its limit, demanding more real-time and scalable integration. Kafka stream helps tackle most of this problem in a very innovative way with minimal development effort. Nothing is better than an example, so let's take a use-case I recently worked on

Use-Case: Events up-to 50k per second are streamed into the system. This needs to be joined with the dimensional dataset. The result of the first join should now be available to be joined with the second stream of data, which then needs to be stored into Mongo, Elastic Cache. To make it interesting dimensional data flows from several sub-systems as well as SQL Server, Dynamo, S3, etc.

Let's try to solve it the old way. We need to load fact and dimension data in some data layer like HDFS, have multiple jobs operate on this data set to aggregate, join and create an intermediate table. Spark jobs can make this much faster but it does require some effort to synchronize them to make them effective as real-time. However, this infrastructure is not easy to scale due to multiple different frameworks/stack involved. I have struggled with this in past and though its highly effective for handling Penta-bytes of data, not really fan when it comes to real-time streams. Bottom line is there is some good amount of development and operational cost and we should keep in mind the thumb rule - Any design which adds operational burden is bad design.

This is what I think should look like. Of course, tech stack can be anything for example- Database can be replaced by Hadoop or SQL server while Jobs can be Spark, SQL jobs on top of SSAS.

Let's see how Kafka stream solves this problem. Again the key is not to store the data but to process it as streams.

First, Kafka is real-time, scalable and distributed streaming platform. This allows any amount of data to be streamed in real-time. You can refer to this blog to understand more about Kafka. Kafka stream-based application works on top of Kafka broker allowing developers to write a fault-tolerant, scalable micro-service. It enables processing data stored in multiple topics as they arrive, define state-store, apply various aggregations, joins, transformation and finally write processed data to one or multiple output topics.

This can be segregated into 4 branches -

  1. First, we need to get all data into Kafka. There are multiple ways to stream data into Kafka. Kafka Connect is most preferred due to it's out of the box distributed, scaling and extensible nature. There are several connectors already available like SQL, mongo, dynamo, s3, elastic, salesforce etc. This is the easiest way to get your data into Kafka. One can also post the data directly and its very easy to write a connector with several clients already available. 
  2. Once data is in Kafka, we need to process it using Kafka stream. This app will read all the data from the fact/dimension topic and apply aggregation, join, transformation. This is done by defining multiple processors. There are two ways one can achieve this - Processor API and DSL. Processor API provides low-level control, while DSL is an abstraction layer on top. Most of the time DSL suffice but if you need much lower control, processor API is a very powerful library which can be easily extended.
  3. If you require to maintain a state, you can use the state store. Let say if we have event1 from one topic, while you need to wait for event2 from a different topic to apply join. Such cases can be solved by storing event1 in state store until the event2 arrives. There are several flavors of state store available and we can even write one. They are of two types and I will come back to it later.
  4. Once we have the final data, it can be written to output topic. This topic can then be streamed to Kafka connect again to store the data to final data stores. For example, the output topic data can be further used to build a searchable index for Elasticcache, reporting data for mongo, dashboard data for Redis stream etc. You can also run a Kafka stream application on top of this topic to further process the data if needed.

The design will somewhat look akin to this. This is a much cleaner implementation for me.

The best part of Kafka stream-based application is, its a micro-service which means we can run several instances of this application and input data will be divided between this application. So, scaling is as easy as adding a new Docker container. Kafka out of the box takes care of multiplexing data. It's as easy as it gets.

State Store

State Store allows you to create a state of the stream. People who have used cache in past might be able to relate it to like Redis. It's slightly more complicated that than. I had this question myself about why not use a centralized Redis cache accessible by all instance. Let's say there is dedicated Redis cluster and all K-stream applications use it for R/W. What happens is, since the Redis is not hosted together in the same container, it needs to query over network remotely and when you have huge data inflow going to network repeatedly will cost you huge performance, eventually turning out to be a bottleneck. This is very important if we are talking real-time here. This problem is solved by using state store bundled with your K-stream application locally. There are two types currently available :

  1. Global State Store- The same copy of data is maintained with all the K-Stream application instances. This state store comes with its own processor, source but no sink. Any data written to this source are read by processors of each K-stream instance and it then allows to apply any transformation, if required and creates a local copy by writing it to store. So essentially when data is written to Kafka topic, all the K-stream application will read the data and create a local copy. 
  2. Local State Store- Every K-stream app instances can have its own copy of data in the local state store. This is useful when you have partitioned the data and you know the app instance will always deal with one type of data. 

The best part is all the state stores are backed by Kafka topic, so its fault tolerant. 

SerDe

This is another concept which is kind of hard to wrap the mind around. When you write data in and out of Kafka the information need to be serialized/deserialized. Avro is the most compatible library used by almost everyone. It allows to define a schema and all the producers and consumer can leverage it to serialize/deserialize while writing/reading to/from Kafka. However, I am not a fan of this approach when you write/read into state store since it's more like IPC sharing, the control lies with the application. Hence one can use whatever he thinks is most efficient. I have used protobuf, Gson for this. 

This example is minuscule to what is possible with Kafka stack. I enjoy every bit of it and I strongly urge the developer community to start exploring it and re-define the integration layer. At last there are several ways of solving the problem and there might be better ways than how i have approached the problem. Let me know how you have solved the problem incase you have used a different approach.

To view or add a comment, sign in

More articles by Anup Kumar

Others also viewed

Explore content categories