2-Way Replication With Apache Kafka

Marcio Andrada

Published Aug 10, 2017

Who works or ever worked with Apache Kafka™ knows that replication of data between Data Centers or different clusters is not very straightforward. Of course, one can use the shipped MirrorMaker, but this tool has some limitations, the first one being the fact that it is unidirectional, i.e., you can only synchronize your data from one "Master" Data Center to a "Slave". In this article, I present a "simple" solution that allows you to replicate your data from one Data Center to each other. But first, some definition of Apache Kafka™.

From the Apache Kafka site definition:

Apache Kafka™ is a distributed streaming platform

Ok, this is a very simplistic definition, but you can find a lot more information on the Apache Kafka website. It can be deifned in many ways, and it is basically an event logging system, based on Publish-Subscribe architecture, but it may be used for a large amount of Use Cases, and it is not the goal of this article to cover them. LinkedIn itself heavily uses Apache Kafka for most of its main operations, and is the most famous and mentioned Use Case in the Kafka website as well.

For the purpose of this article, what is important to know is that:

Apache Kafka uses Topics to publish data
It is managed by Apache Zookeeper, which is shipped along with Kafka
A Zookeeper Cluster is called "Zookeeper Ensemble"
An Apache Kafka Cluster is managed by a Zookeeper Ensemble
And last, but definitely NOT least, the recommendation is that you keep Kafka Cluster and Zookeeper Ensemble within the same Data Center, mainly because of network latency between Data Centers. And here is where the theme of this article kicks in.

So, naturally one question is raised: "what happens if I need to replicate the data between Data Centers?". If you had a "cross-Data Center" cluster / ensemble you would not need to worry about replication, since Zookeeper and Kafka would take care of it. But, since this is not recommended, I provide a new solution for this problem in this article.

At a first look, you could use MirrorMaker to solve this problem. MirrorMaker is a tool that comes with Apache Kafka out-of-the-box. Its purpose is to replicate data between Data Centers, and its function and setup is not really complex: it sets a listener on your original Topic in one Data Center and publishes every incoming message to the destination Topic on the other Data Center.

Simple and clear, but it brings one issue: you NEED to have one Data Center working as "Master" and the other as "Slave". It means that replication occurs ONLY from Master to Slave. Any publication made on Slave will NOT be replicated to Master. MirrorMaker works like this by design.

For many cases, this architecture will work, but there are cases where you need to provide a 2-way replication, which cannot be achieved with MirrorMaker. You may even think that it would be a simple matter of setting the MirrorMaker on the "Slave" Data Center, and problem solved! But this is not so simple: if you do this, you end-up in a loop, since the messages published on "Slave" would be published to "Master", which in turn would be picked by the "Master" MirrorMaker and published back to "Slave", which would be read by the "Slave" MirrorMaker again and sent to "Master", and so on... It would never end! Therefore, if you need 2-way synchronization, I propose a new solution, which is also not so complex, but requires some setup to work and some things to consider.

Basically, the solution is pretty simple: for each Topic <SOME_TOPIC> that you need to publish to, you will create an exactly equal Topic named <SOME_TOPIC_REPLICA>. Then, your application will need to be developed in a way that EVERY publishing is made to the "REPLICA" Topic, while all subscription will be made to the "real" Topic.

If you setup your application like this, it gets really simple to apply a 2-way replication - you cannot use MirrorMaker even for this solution, though, because MirrorMaker acts exclusively on Topics with the SAME name.

But all you need to do now is create a listener (which I'm calling here "Custom MirrorMaker") that Subscribes to <SOME_TOPIC_REPLICA> on one Data Center and publishes to <SOME_TOPIC> on BOTH local and remote Data Centers. Then you set the same component on the other Data Center. The diagram below shows an overview of how it would be setup.

Não foi fornecido texto alternativo para esta imagem

Done! When you do this, all data that is published to <SOME_TOPIC_REPLICA> Topic will be replicated to <SOME_TOPIC> on BOTH Data Centers, so that all your consumers get the messages, no matter if they are on the local or remote Data Center, keeping the data in synch both ways.

Of course, you need to be VERY careful about several aspects of this solution, so on the next article I will go thru the issues that may arise and propose a solution for each one.

If you have questions about this matter or if information is not clear, please feel free to ask or send me messages.

On "part 2" of this article I try to depict some few possible issues with this solution and I try to provide some ideas to try to solve them.

See you soon!

Devrim Barış Acar 5y

Good article, however the new MirrorMaker 2 eases multi datacenter replication (one way or two ways) out of the box. With kafka 2.7 it also replicates consumer offsets.

2 Reactions

Kothandaraman Sikamani 6y

Marcio, thanks for article and how can synchronize consumers from both data centers if i want to make as consumer group like so only one time it will be processed ? Thanks for the article again...

Wojciech Andrzej Michalik 7y

Marcio ! You have to provide second part of the article.

Ravi Bhargava 7y

How did you implement Custom MirrorMaker - would you use Apache Storm ?

See more comments

To view or add a comment, sign in

2-Way Replication With Apache Kafka

Marcio Andrada

More articles by Marcio Andrada

Others also viewed

Creating Store and Forward Capabilities for the Edge with NiFi

Data Pipelines with Apache Nifi

Couchbase is Simply Awesome

Introduction to keyspaces in Apache Cassandra

Scaling Kafka for High-Throughput Applications: Tips from the Field

Kafka - A long Term Storage Engine

Apache Cassandra® Keyspaces and Data Replication Strategies

AWS DMS Integration with PostgreSQL-Redshift

Using Postgres DB Debezium connector for Change Data Capture in kafka.

Explore content categories

More articles by Marcio Andrada