Toward Consumer-driven Pipelines

Ryanne Dolan

Published Jul 7, 2022

Produce it, and they will consume.

Prior to Kafka, a major architectural problem was "fan-out", where a service must notify N other services of some change or event. Pub/sub systems help here since any number of consumers can come along and subscribe to a topic. Kafka does this while making relatively strong delivery guarantees, which you definitely don't want to try doing within each service. The lovely thing about pub/sub is that data producers don't need to know about their consumers, which is good because they likely don't exist yet.

This naturally yields what I call "producer-driven architecture", wherein an organization's data landscape grows organically based on what is already being produced. New streams are built on top of old ones, and so on, with everything spreading outward forever. If you need some data, you are (probably) in luck! It's right here in this stream. If you can't find exactly what you need, you can easily derive it from some existing set of streams. The best part: now someone else can use that new stream too.

This model has proven extremely useful for the last decade, but I am concerned we're reaching a tipping point. Frequently, the original producer of some data hasn't picked the exact right schema, key, or partitioning scheme for every subsequent consumers' specific needs, requiring downstream teams to re-key, re-partition, or re-encode the stream into another form. Trivial data pipelines such as these are shockingly common.

And of course not everything is in Kafka topics, so this problem extends into offline and online storage systems as well. A team may want some data in Kafka when it only exists in HDFS, so they stand up a reverse ETL job. Another team may need the same data with a different key, and so stand up another reverse ELT job. Ultimately, we end up with almost as many pipelines as we have topics.

Which brings us to today, where the "modern data stack" includes ETL, reverse ETL, CDC, connectors, mirror-makers, and materializers, all of which, by definition, can only transform the data you already have into different forms and in different places. These are necessary only because the applications (consumers, services, jobs, etc) that ultimately use this data need it in some place or in some form that doesn't already exist.

Recommended by LinkedIn

Your Data, Your Rules: How to Avoid Lock-In with Smart…

Barbara Dravec, PMP, MA 8 months ago

Ep. 7: The Rise of Zero ETL | By The Data Alchemist

Ravichander R. 10 months ago

Re-define data integration with Kafka Stream

Anup Kumar 7 years ago

The worst offender, frankly, is CDC. These fundamentally complicated pipelines exist only because someone decided to stick data into an online database instead of into a Kafka topic. While that decision was almost certainly the best one at the time, it is probably not the best decision from any subsequent consumer's perspective.

Consume it, and they will produce?

Can we flip this on its head? I have been advocating for an inverse model which I call "consumer-driven architecture". In this model, consumers declare what data they need (a "view") instead of finding existing streams to consume, APIs to call, or tables to query. This enables the platform to materialize the view automatically, and potentially in different ways over time as the landscape changes.

In the happy case where producers and consumers happen to agree on everything, we wouldn't incur any additional cost from using materialized views, since there is then nothing to materialize. A view of a topic is just the topic, if it happens to have the right shape. But if it doesn't, we can automatically wire-up a pipeline that delivers a derived topic. Under the hood, this may involve complex CDC and reverse ETL jobs spanning online and offline systems, but the consumer would just see a topic. Moreover, we wouldn't need to stand up these pipelines manually, since the applications declare exactly what they need, and we can derive pipelines from that.

So that's the dream: a good data pipelines platform should not be producer-driven, but should instead look at what data is needed and deliver it accordingly. This would make it possible to refactor and optimize pipelines over time without consumers noticing, so long as we ensure views are materialized correctly. With some sophisticated automation, we should be able to construct such pipelines automatically.

Jeffrey M. Birnbaum 3y

Ryanne, I think you are on to something here. Some of what happens today reminds me of the brittle COM/DCOM days where the coupling between producer/consumer does not stand the test of time as a business evolves. AMPS allows one to build a "View" that can be formed from multiple topics (join etc) and then produce the output data in a given format, e.g. FIX from topic A, JSON from topic B and then output is JSON. AMPS will soon release "dynamic sow" capability which allow "views" (and sows and queues) to be created (destroyed) on the fly (with proper seeding of existing data if desired). When coupled with SQL-92 content filtering and dynamic subscription aggregation then you get much closer to the promise of real-time streaming. https://www.crankuptheamps.com/

4 Reactions

Mayuresh Gharat 3y

Something like https://materialize.com/ seems like a reasonable solution to begin with?

Toward Consumer-driven Pipelines

Ryanne Dolan

Recommended by LinkedIn

More articles by Ryanne Dolan

Others also viewed

15 Real-World Data Pipeline Issues and How to Solve Them Like a Pro

Solving Real-Time Data Sync Challenges with Kafka Connect + Debezium CDC + CQRS

Six uses of Apache Kafka you should know

💊 DATA Pill #154 - Flink or Kafka Streams? Apache Airflow® 3

How to Handle Modern Event Streams in dbt

Schema Evolution in Databricks Auto Loader

Road to Lakehouse - Part 1: Delta Lake data pipeline overview

Delta Lake Intro

Kafka Connect: Build and Run Data Pipelines - Book Review, Paul Brebner

Transforming Your Data Pipeline with Delta Lake & AWS Glue

Explore content categories

Recommended by LinkedIn

More articles by Ryanne Dolan

Kafka Connect + Hoptimator: 0 to 1 with No Code

Control Planes and the Death of the Cluster

On the availability of data pipelines

you too can vim-fu

InReview: Mutating vs rebinding

InReview: turn nested loops into pipelines

In Review: when sharing code wastes time

In Review: match nested exceptions in Scala

Remote Survival Guide

Jirawocky

Others also viewed

15 Real-World Data Pipeline Issues and How to Solve Them Like a Pro

Solving Real-Time Data Sync Challenges with Kafka Connect + Debezium CDC + CQRS

Six uses of Apache Kafka you should know

💊 DATA Pill #154 - Flink or Kafka Streams? Apache Airflow® 3

How to Handle Modern Event Streams in dbt

Schema Evolution in Databricks Auto Loader

Road to Lakehouse - Part 1: Delta Lake data pipeline overview

Delta Lake Intro

Kafka Connect: Build and Run Data Pipelines - Book Review, Paul Brebner

Transforming Your Data Pipeline with Delta Lake & AWS Glue

Explore content categories