Toward Consumer-driven Pipelines
Produce it, and they will consume.
Prior to Kafka, a major architectural problem was "fan-out", where a service must notify N other services of some change or event. Pub/sub systems help here since any number of consumers can come along and subscribe to a topic. Kafka does this while making relatively strong delivery guarantees, which you definitely don't want to try doing within each service. The lovely thing about pub/sub is that data producers don't need to know about their consumers, which is good because they likely don't exist yet.
This naturally yields what I call "producer-driven architecture", wherein an organization's data landscape grows organically based on what is already being produced. New streams are built on top of old ones, and so on, with everything spreading outward forever. If you need some data, you are (probably) in luck! It's right here in this stream. If you can't find exactly what you need, you can easily derive it from some existing set of streams. The best part: now someone else can use that new stream too.
This model has proven extremely useful for the last decade, but I am concerned we're reaching a tipping point. Frequently, the original producer of some data hasn't picked the exact right schema, key, or partitioning scheme for every subsequent consumers' specific needs, requiring downstream teams to re-key, re-partition, or re-encode the stream into another form. Trivial data pipelines such as these are shockingly common.
And of course not everything is in Kafka topics, so this problem extends into offline and online storage systems as well. A team may want some data in Kafka when it only exists in HDFS, so they stand up a reverse ETL job. Another team may need the same data with a different key, and so stand up another reverse ELT job. Ultimately, we end up with almost as many pipelines as we have topics.
Which brings us to today, where the "modern data stack" includes ETL, reverse ETL, CDC, connectors, mirror-makers, and materializers, all of which, by definition, can only transform the data you already have into different forms and in different places. These are necessary only because the applications (consumers, services, jobs, etc) that ultimately use this data need it in some place or in some form that doesn't already exist.
Recommended by LinkedIn
The worst offender, frankly, is CDC. These fundamentally complicated pipelines exist only because someone decided to stick data into an online database instead of into a Kafka topic. While that decision was almost certainly the best one at the time, it is probably not the best decision from any subsequent consumer's perspective.
Consume it, and they will produce?
Can we flip this on its head? I have been advocating for an inverse model which I call "consumer-driven architecture". In this model, consumers declare what data they need (a "view") instead of finding existing streams to consume, APIs to call, or tables to query. This enables the platform to materialize the view automatically, and potentially in different ways over time as the landscape changes.
In the happy case where producers and consumers happen to agree on everything, we wouldn't incur any additional cost from using materialized views, since there is then nothing to materialize. A view of a topic is just the topic, if it happens to have the right shape. But if it doesn't, we can automatically wire-up a pipeline that delivers a derived topic. Under the hood, this may involve complex CDC and reverse ETL jobs spanning online and offline systems, but the consumer would just see a topic. Moreover, we wouldn't need to stand up these pipelines manually, since the applications declare exactly what they need, and we can derive pipelines from that.
So that's the dream: a good data pipelines platform should not be producer-driven, but should instead look at what data is needed and deliver it accordingly. This would make it possible to refactor and optimize pipelines over time without consumers noticing, so long as we ensure views are materialized correctly. With some sophisticated automation, we should be able to construct such pipelines automatically.
Ryanne, I think you are on to something here. Some of what happens today reminds me of the brittle COM/DCOM days where the coupling between producer/consumer does not stand the test of time as a business evolves. AMPS allows one to build a "View" that can be formed from multiple topics (join etc) and then produce the output data in a given format, e.g. FIX from topic A, JSON from topic B and then output is JSON. AMPS will soon release "dynamic sow" capability which allow "views" (and sows and queues) to be created (destroyed) on the fly (with proper seeding of existing data if desired). When coupled with SQL-92 content filtering and dynamic subscription aggregation then you get much closer to the promise of real-time streaming. https://www.crankuptheamps.com/
Something like https://materialize.com/ seems like a reasonable solution to begin with?