Change Data Capture (CDC): Lessons Learned Building a Solution from Scratch

Once upon a time, in a data pipeline not so far away

At Rivery, we are all about data, specifically managing the data pipeline lifecycle. So, it’s no surprise that part of what we need to do is to not only move data from point A to point B. What we also need to be able to do is to move updates and changes from point A to point B in the most efficient and fastest way possible.

So we went out looking for the best solution to help our customers take care of these updates and changes, i.e., with change data capture (CDC).

What we found is a well-regarded open-source CDC solution. But it just didn’t work for us. There were a number of hurdles that we had to overcome, and which ultimately made this option a non-starter.

If you’re looking to deploy a CDC platform, then it’s worthwhile to jot down the challenges we faced, and learn from our experience, so you can be more confident in the solution you choose and that it will support your needs.

But first, let’s take a quick look at why CDC is so important in the first place. And then we’ll get into why the platform we chose wasn’t optimal for our needs, and what we had to do to make CDC work for us and our customers.

Why CDC in the first place?

The need for replication speed

As data engineers and analysts, we need to be able to move data from our relational databases (point A), such as SQL Server or MySQL, to data warehouses, data lakes, or other target databases (point B).

Whenever there’s a change or update in the database, we also need to be able to sync between the two in as close to real time as possible, and with as little friction and complexity as possible.

Traditionally, doing this has meant batch data replication, executing once or several times a day. But batch data pulling requires additional computing, provides insufficient inputs on the history of deleted rows, and entails higher latencies.

When it comes to data replication, data engineers and analysts need a different approach.

In comes a different approach

To reduce the overhead, eliminate latency, and enable real-time analytics, organizations have moved away from batch or bulk load updating to incremental updating with change data capture.

Basically, CDC speeds up data processing by eliminating the need for full-scale database replication in the ETL/ELT pipeline, and creating an analytics database as a separate copy of the production database.

This process entails identifying and capturing any data change (i.e., insert, updates, deletes) to the database logs in real time at point A, using the database engine's native API and delivering changes to point B.

Because it only deals with changes to the logs, it eliminates the need for ongoing database replication using the database engine, thereby minimizing the resources required for ETL/ELT processes.

And since it deals with new database events as they occur, it enables real-time or near-real-time data movement.

As such CDC is ideal for near-real-time business intelligence as well as for cloud migrations.

The Debezium path to CDC

At first, we decided to go with Debezium. Among the different CDC tools out there, with 7K GitHub stars and 1.8K GitHub forks, Debezium is one of the most popular.

This is an open-source distributed platform that’s built on top of Apache Kafka, where you can:

“Start it up, point it at your databases, and your apps can start responding to all of the inserts, updates, and deletes that other apps commit to your databases.” (Debezium)

The advantages of Debezium are said to include:

Durability and speed
Scalability in handling large volumes of data
Incubating connectors for MySQL, SQL Server, Postgres, and Oracle
Community sink connectors for ElasticSearch
Sink connectors for MySQL and SQL Server
MongoDB connector maturity
Fault tolerance

And, of-course, the fact that it’s open-source was an additional advantage for us. So, we went with Debezium.

Our short affair with Debezium

We were feeling good. Optimistic. Debezium is known for being great. And we want nothing less than the best for our customers.

We implemented Debezium, which we were planning to manage on the backend, as our customers were to enjoy its benefits when CDC was required by processes executed by our platform on the front end.

But, sadly, Debezium just didn’t work for our needs. Some of the main issues we faced include:

Running after open source version updates

Embedding a core solution which is open source into your own product is challenging. You need to run over any new version, track the changes, and make sure your product is stable in light of all these changes, all while reviewing all other prioritizations, fixing velocity, and implementing security patches. So, while Debezium is a well-regarded solution for some cases, for us, the overhead from version upgrades, and the impact on our customers, made the effort greater than expected.

Error messages

With Debezium, as with any solution that’s built on Kafka Connect, error messages provide the full Java stack, without the option to search for the root cause underlying the message. This lack of visibility made remediation very difficult, and sharing with our customers what the real error was behind the message, quite challenging.

Scale

On Debezium and Kafka Connect all the pipelines run on the same connectors. The result is a round robin among the pipelines that makes scale complex and often out of reach as the number of sources grows.

Code transparency

Debezium uses Kafka Connect to move data into Kafka. The issue with Kafka Connect is that it doesn’t come with the required level of code transparency, which also complicated the maintenance of CDC processes for hundreds of different types of configurations that our customers have executed in their own databases.

Logging

Debezium logs are not aligned with our formatting, nor do they provide information that we needed. To illustrate, every row in the error stack is provided as one row in the logs. This is something that makes tracking very challenging. Some logs are even hidden in certain use cases, while in other cases there is over-logging.

Topologies support

While it does offer support for certain topologies, it doesn’t support the hundreds of different types of topologies that are used by our customers.

SSH tunnels

Whenever a customer organization was using SSH tunnels for connectivity, the solution couldn’t support it.

We needed (and ultimately created) a solution that would connect to our client’s databases using different topologies, such as SSH tunnel, SSL, and VPN tunnels. SSH tunnel is commonly used to connect remote databases in internal VPCs, for example, and isn’t embedded out of the box in Debezium.

Use cases

The biggest issue is that Debezium use cases couldn’t be tailored to our customers’ needs. We were getting the feedback that the open platform simply wasn’t working with their specific data types and tables, for example.

So, What to make sure you evaluate when considering a CDC platform?

Frequent version updates: will you be able to handle all the merges and heavy maintenance?
Error messages: does the platform deliver clarity for quick and efficient remediation?
Scale: does it work with all of your database types?
Code transparency: can you easily get into the code to make changes when needed?
Logging: do the log files provide the information you need?
Topologies support: does it support your topology?
SSH tunnels: does it support SSH connectivity and other connectivity best practices?
Use cases: can it handle all your use cases, tables, etc.?

Unfortunately for us, the answer to all of the above was ‘no.’ So, we decided to take things into our own hands, and develop our own CDC solution.

A new approach to CDC

Our goal was to create another solution, our own platform, which we could intuitively know how to maintain and debug. When the code is yours, you have access to it, you have visibility, you know how to handle any issue that might come up.

But building your own CDC platform is no simple task. It requires a lot of knowledge about so many different possible specifications that need to be applied to just about every database. It’s not just about knowing your own data, database, tables, and topology. It’s about knowing everything about them, and how to get data from point A to point B seamlessly, quickly, and efficiently.

But for us, and for our customers, the advantages to creating our own CDC platform, far outweighed the challenges.

So, what did we do? We mapped out all the issues we were having , aggregated insights, got our best and brightest together for the task, and built our own CDC platform as follows.

For the programming language we evaluated and considered Python and Go.

In our consideration we quickly realized that Python doesn’t support multithreading well enough, nor does it offer a robust solution for Kafka. Since multithreading is so important for processing many events at the same time (a pillar of CDC), and since our solution is on Kafka, we couldn’t go with Python.

So, we went with Go, which offers many advantages:

Multithreading out of the box (using Goroutines).
Task queues using Go channels for concurrency out of the box.
Great support for I/O.
Stability
Flexibility
Very Light.

The architecture

We built an architecture that drives high rates of CDC efficiency. For example, the consumer pushes data log entries to queues and everything is orchestrated by the manager, which also performs validation for MySQL and creates a status report if something doesn’t fit or connect.

This also drives great flexibility, enabling a new connector to be deployed in less than two weeks, as opposed to the three weeks that are typically required.

SSH connectivity and other topologies

To support our customers’ connectivity needs, we created our own SSH tunnel as an external service that we embedded as a side car. This new solution provided us with the ability to align with our clients’ connecting topologies, as well as to offer better, higher performance, and more stable solutions for their needs.

Centralized management

The consumer in this architecture can push data to Kafka. It’s a system that can get messages at a very high capacity, knowing how to bring a lot of different data types very quickly from a lot of different sources, all with the ease and the clarity of central management.

A transactional process against Kafka

Now, when it comes to pushing data to Kafka, the execution needs to be transactional through the Kafka API. So, we needed to create a transactional process that would collect all the data that we needed to push, but which would also let the user know where this collection started and to where exactly it would be pushed. And this is the hard part!

But this is exactly what we did. Having such visibility into the process helps you to avoid delays. Because when events interrupt the consuming, it’s hard to know where you started, where you stopped, and where you need to pick up again. And this can cause a lot of very problematic delays.

Going back and forth enables us to manage any change in our customers' databases and make sure we’re catching all of the messages quickly and easily.

The flip/flop functionality

Knowing where you started and where you left off is also key to the flip/flop functionality that we designed and implemented.

What I mean by that is that we created two queues instead of one, so that we could make sure that messages are accepted by Kafka while at the same time we could fetch more data from the source without causing any processes to be blocked.

So, whenever one queue would get filled up, the data coming in would go into the second empty queue instead of creating a backlog. Only once the full queue had emptied out, would we get back to using it for new data coming in.

This flip/flop functionality is only really possible when you know where you started, where you are, and where you left off, so you can go fetch only the data you need from the database, without any unnecessary replications.

The benefit of this approach is not only about avoiding delays. Although this was fantastic – where we reduced delays down to about 10 seconds vs. the 2-3 minutes that we had with Debezium when we opened it for scale.

Additional benefits include:

Eliminating the need for redundancy
Greater stability
Scalability out of the box

And the biggest benefit of all was that we could finally focus on the business logic and address our customers’ needs, instead of expending tons of energy on database replication.

No overhead. Just value.

In conclusion

So, if you need CDC to help you streamline, accelerate, and drive efficiency in synchronizing data, you could go with a popular platform, but it may not be able to meet all of your needs.

Or you could design your own – but think of the overhead. Would you really want to take all that on? Especially when we did all the heavy lifting already for you?

Bottom line, with the Rivery CDC solution you get some of your toughest challenges pre-solved:

A language that’s very light, and which supports multithreading out of the box, compatibility with multiple channels, I/O support, and stability

A flexible architecture that drives high rates of efficiency

Built-in SSH connectivity

A transactional process that brings visibility and minimizes delays

Flip/flop functionality with two queues which eliminates the need for redundancy, provides greater stability, and delivers scalability out of the box

Freeing up to focus on business logic

Change Data Capture (CDC): Lessons Learned Building a Solution from Scratch

Alon Reznik

Once upon a time, in a data pipeline not so far away

Why CDC in the first place?

The need for replication speed

In comes a different approach

The Debezium path to CDC

Our short affair with Debezium

Recommended by LinkedIn

So, What to make sure you evaluate when considering a CDC platform?

A new approach to CDC

The architecture

SSH connectivity and other topologies

Centralized management

A transactional process against Kafka

The flip/flop functionality

In conclusion

Others also viewed

Top 7 things to do to succeed in a Big Data project

Part 5: Requirements of a Unified Data Delivery Platform

"Topic: Data Abstraction-3 levels, Entity Relationship (ER) Model and Database Normalization"

How Apache Hudi Transformers Revolutionizes Data Transformation in Delta Streamer

The Role of Open Table Formats in Data Lakehouse

Data Architects could be tomorrow’s most prominent technologists and this is what they might be thinking about (Part 1): Data Integration

What is the Data Lakehouse and the Role of Apache Iceberg, Nessie and Dremio?

3 Reasons Data Engineers Should Embrace Apache Iceberg

Data Lineage challenge: A holistic solution

Architecting Data: The Dual Paradigms of Schema on Write and Schema on Read

Explore content categories