Why streaming data needs a lakehouse revolution

5,832 followers

10mo Edited

🚨 Data streaming is due for its Iceberg moment. In this new post, our CEO, Sijie Guo, and Principal Sales Engineer, David Kjerrumgaard, explore why tightly coupled streaming architectures like Apache Kafka are hitting their limits—and what the future looks like. 👉https://hubs.ly/Q03tR4QS0 Drawing lessons from the lakehouse revolution (e.g., Apache Iceberg), they propose a three-layer architecture for streaming:  🔹 Data – Durable, scalable object storage  🔹 Metadata – Decoupled and authoritative  🔹 Protocol – Stateless, multi-interface routing Streaming should be as modular, cloud-native, and vendor-neutral as modern analytics: 📉 Cut costs by up to 90%  ⚙️ Evolve each layer independently 🔓 Enable open, multi-tool access Projects like #ApachePulsar and #StreamNativeUrsa are already paving the way!🌟 #StreamingData #Kafka #Iceberg #Lakehouse

To view or add a comment, sign in

More Relevant Posts

SAM S. G. Fattahpour
6mo
Report this post
🔥 Building Stream Processing Platform at OpenAI: insights from Current London 2025 Just watched an incredible session by Shuyi Chen from OpenAI, showcasing how they designed and scaled a powerful stream processing platform using Apache Flink and Apache Kafka. What stood out most was the engineering discipline behind reliability and scale. OpenAI’s platform processes massive data streams in real time to support internal analytics, model monitoring, and system observability, all while maintaining low latency and strong consistency. Key takeaways: ⚙️ Integration between Flink and Kafka for exactly-once delivery and resilience 🌍 Multi-region deployment for fault tolerance and global reach 🐍 Enhancements to PyFlink for production-grade performance 🔄 Operational lessons on monitoring, scaling, and cost control 🚀 Future roadmap for advanced streaming capabilities and self-service pipelines It’s a perfect example of how streaming data architectures are evolving beyond traditional ETL, becoming the backbone for intelligent, real-time systems. If you work with data engineering, cloud platforms, or AI infrastructure, this session is a must-watch. 👉 Watch here: https://lnkd.in/gbQWJzz7 #OpenAI #ApacheFlink #ApacheKafka #StreamProcessing #DataEngineering #CloudArchitecture #RealTimeData #BigData #Confluent #Current2025

Building Stream Processing Platform at OpenAI current.confluent.io
Like Comment
To view or add a comment, sign in
Scotty May
7mo
Report this post
Why are we still stuffing state into embedded RocksDB for Kafka Streams in 2025? 🤔 Responsive just walked through a real migration: replacing #RocksDB with ScyllaDB as the remote state store—turning stateful stream processors into stateless, autoscalable services. The result: faster recovery, simpler ops, and the freedom to scale compute and state independently. Key provocations from the talk: - “Kafka-only” is a legacy constraint. Decouple state from compute and your architecture (and on-call) breathes again. - Availability beats rebuilds. Stop replaying giant changelogs to warm up local stores. - Zombies are real. Fence them with epochs + lightweight transactions, not wishful thinking. - Quorum isn’t a “nice to have.” Consistency levels matter (and defaults can bite). - Bigger nodes ≠ waste. Right-sizing ScyllaDB nodes can reduce scaling churn and boost throughput. If you’re running Kafka Streams at any meaningful scale, this is a mindset shift: state belongs in a distributed database purpose-built for low latency—not glued to your JVM. 🎥 Dive into the architecture deep-dive, lessons learned, and practical tips here: https://lnkd.in/gvfRvdv5 #Kafka #KafkaStreams #ScyllaDB #StreamingData #EventDriven #RocksDB #NoSQL #CassandraCompatible #CloudNative #DataEngineering #LowLatency #Throughput

Replacing RocksDB with ScyllaDB in Kafka Streams scylladb.com
Like Comment
To view or add a comment, sign in
Nagesh Patil
6mo
Report this post
Sharing my Kafka journey 🚀 As I explored Kafka’s ecosystem, I was impressed by how it transforms log analysis 📋, powers data-driven recommendations 💡, enhances real-time monitoring and alerting 🚨, and streamlines change data capture and migration 🔄. Integrating with Flink, Elastic, and Kibana, Kafka ensures your microservices and databases deliver high-quality, actionable insights to analytics and machine learning tasks 🤖.Highly recommend Kafka for building scalable, resilient, and real-time enterprise solutions that drive innovation and reliability 💯. If you want to unlock robust event streaming, seamless migrations, or sophisticated monitoring, give Kafka a shot!. #Kafka #Flink #BigData #DataEngineering #Microservices #SystemDesign #RealTimeAnalytics
Like Comment
To view or add a comment, sign in
Kiril I.
6mo
Report this post
🚀 From Phantom Messages to Durable Event Processing Recently I've worked for a client where they had an old monolith facing a frustrating problem: messages were being broadcast before they were safely persisted in SQL. 👉 On a system crash, those “shadow” messages vanished. 👉 Users saw phantom notifications that disappeared without a trace. 👉 Reliability and trust were at risk. 🔄 The Transition As the #softwarearchitect I re‑architected the flow around durable event streaming with Kafka: Kafka as the backbone → every message is durably logged, ordered, and replayable. SQL + sharding → guarantees persistence and scalability of the system of record. Parallel broadcasting → services like logging, finance, and analytics consume the same event stream in real time. Redis caching → accelerates reads and reduces load on the database. Cellular architecture → each zone runs as an independent copy, isolating failures and improving resilience. ✅ The Result No more phantom messages. Durable, ordered, replayable events. A foundation that scales with business growth. Clear separation of concerns: persistence, caching, and broadcasting all aligned. ✅ Lesson learned: Durability isn’t just a technical detail — it’s the difference between a system users trust and one they don’t. By moving from a monolith with fragile messaging to a streaming‑first architecture, we built reliability into the core of the platform. #DotNet #SoftwareArchitecture #EventDrivenArchitecture #Kafka #EventStreaming #Microservices #CloudArchitecture #DistributedSystems #Architecture #CloudNative #DevOps #Observability
Like Comment
To view or add a comment, sign in
KLogic

142 followers
6mo Edited
Report this post
Powering Real-Time Data Pipelines: How Apache Kafka Keeps Your Data Flowing At KLogic, we simplify how real-time data moves, transforms, and delivers insights across systems. Here’s a quick breakdown of how Kafka powers modern data pipelines, from event ingestion to consumption. 🔸 Data Ingestion This is where it all begins. Producers publish events to Kafka topics, batching messages efficiently for high throughput. With replication and fault tolerance, data remains durable and available, even during failures. 🔸 Stream Processing Kafka brokers handle partitions and offsets, ensuring scalable, parallel processing. Stream processors (like Kafka Streams or Flink) transform data in motion, aggregating, filtering, or enriching it in real-time. 🔸 Data Consumption Consumers subscribe to topics, pulling data as needed. With load balancing and consumer groups, Kafka ensures seamless scalability and ordered delivery, driving real-time insights and system integration. Why it matters: Kafka isn’t just about message streaming; it’s about building resilient, event-driven architectures that keep your data flowing instantly and reliably. Learn More: https://klogic.io/ #ApacheKafka #DataEngineering #StreamingData #EventDrivenArchitecture #RealTimeAnalytics #KLogic
Like Comment
To view or add a comment, sign in
Shyam Varshan
6mo
Report this post
Powering Real-Time Data Pipelines with Apache Kafka Ever wondered how data flows seamlessly across systems, from the moment it’s created to when insights hit your dashboard? At KLogic, we break down how Apache Kafka keeps data moving in real time, ensuring reliability, scalability, and fault tolerance at every step. Check out how ingestion, stream processing, and consumption come together to build the backbone of modern, event-driven architectures. #ApacheKafka #DataEngineering #RealTimeData #EventDrivenArchitecture #StreamingData #BigData #KLogic
KLogic

142 followers
6mo Edited

Powering Real-Time Data Pipelines: How Apache Kafka Keeps Your Data Flowing At KLogic, we simplify how real-time data moves, transforms, and delivers insights across systems. Here’s a quick breakdown of how Kafka powers modern data pipelines, from event ingestion to consumption. 🔸 Data Ingestion This is where it all begins. Producers publish events to Kafka topics, batching messages efficiently for high throughput. With replication and fault tolerance, data remains durable and available, even during failures. 🔸 Stream Processing Kafka brokers handle partitions and offsets, ensuring scalable, parallel processing. Stream processors (like Kafka Streams or Flink) transform data in motion, aggregating, filtering, or enriching it in real-time. 🔸 Data Consumption Consumers subscribe to topics, pulling data as needed. With load balancing and consumer groups, Kafka ensures seamless scalability and ordered delivery, driving real-time insights and system integration. Why it matters: Kafka isn’t just about message streaming; it’s about building resilient, event-driven architectures that keep your data flowing instantly and reliably. Learn More: https://klogic.io/ #ApacheKafka #DataEngineering #StreamingData #EventDrivenArchitecture #RealTimeAnalytics #KLogic
Like Comment
To view or add a comment, sign in
Valentyn Logvynskyi
6mo
Report this post
It is Monday and we continue with the Data Streaming! Last week we talked about #KafkaConnect, but #Pulsar has its own, arguably simpler solution for data engineers: #PulsarIO. 🚀 Apache Pulsar IO streamlines data integration using its built-in connector framework. Not only does it have a rich library of connectors (including CDC) and simple deployment, but it also provides crucial processing guarantees. #ApachePulsar #PulsarIO #DataIntegration #StreamingData #RealTimeData #DataEngineering #EventDrivenArchitecture #MessageQueue #CDC Full-size image is available here: https://lnkd.in/eaubqMxV Part 1: Use-cases - https://lnkd.in/eKGk2CFc Part 2: Protocols - https://lnkd.in/eqFU3hVA Part 3: Kafka broker decision tree - https://lnkd.in/eJpPGCVd Part 4: Message & Headers - https://lnkd.in/eSWDsAqB Part 5: Message Formats - https://lnkd.in/eRYUFt3D Part 6: Schema Registry - https://lnkd.in/efz2tpeX Part 7: KRaft - https://lnkd.in/ePzJ4EUm Part 8: Zookeeper @ Pulsar - https://lnkd.in/ewFDJrTV Part 9: Kafka Connect - https://lnkd.in/egwxjqr4
8 Comments
Like Comment
To view or add a comment, sign in
Aditya Maheshwari
7mo
Report this post
Today, I revisited the 2011 paper, “Kafka: a Distributed Messaging System for Log Processing,” by Jay Kreps, Neha Narkhede, and Jun Rao, which laid the foundation for Apache Kafka. What stood out to me is how LinkedIn’s engineering team challenged the assumptions of traditional messaging systems: * Logs as the core abstraction, Kafka treats all messages as an immutable, append-only log. This enables constant-time disk writes and sequential reads, unlike queue-based systems that struggled with scaling consumers. * Consumer-controlled state: Instead of brokers tracking which messages each consumer has read, Kafka shifts this responsibility to the consumers using offsets. This drastically reduces broker complexity and allows multiple independent consumers to process the same stream efficiently. * Partitioning & replication: Kafka introduced partitioned topics for horizontal scalability and replication for durability, pioneering the concept of a distributed commit log. * Throughput over delivery guarantees: Kafka prioritized high throughput and durability over complex delivery semantics. That trade-off made it a perfect backbone for both real-time stream processing and offline analytics pipelines. ## Think of Kafka as the central nervous system of modern data infrastructure; every event, click, and transaction flows through it in real time. It’s fascinating that a solution to LinkedIn’s log aggregation challenge evolved into the event streaming platform that powers most data-driven architectures today. Highly recommend reading this paper if you’re passionate about distributed systems, data pipelines, or system design trade-offs; it’s a masterclass in engineering simplicity at scale. #Kafka #DistributedSystems #Streaming #BigData #SystemDesign #EventDrivenArchitecture
Like Comment
To view or add a comment, sign in
Udit Gupta
7mo
Report this post
Keeping data in sync across services is always harder than it looks. Change Data Capture (CDC) with tools like Debezium helps by streaming database changes into Kafka. But the tricky part is what comes next - how do you consume those events without losing consistency or overwhelming your system? I've found Go really useful here. With goroutines and solid Kafka libraries, you can: Push through a heavy stream of CDC events Handle retries without duplicating work Keep downstream services or caches up to date, even when the database is busy A simple example: capture all orders table updates with Debezium -> push them into Kafka -> process them with a Go consumer to keep Redis or another service in sync. No polling, no messy cron jobs, and everything stays reactive. Curious how others handle this: do you prefer at-least-once delivery with idempotent consumers, or have you gone down the path of exactly-once? #golang #kafka #debezium #cdc #eventdriven #microservices #backend #softwareengineering
Like Comment
To view or add a comment, sign in
MockMe.ai

21 followers
7mo
Report this post
🚨 System Design Interview Question (Amazon) Q: Design a distributed logging system that ingests onboarding events at scale, validates and transforms them asynchronously, provides real-time aggregation, stores queryable metadata, archives raw data, and exposes APIs for querying. A (sneak peek): Kafka for durable ingestion Spark Streaming for validation + aggregation Cassandra for fast metadata queries S3 for long-term archival Monitoring with Prometheus + Grafana Full breakdown 👉 https://lnkd.in/eKjNCQSU
Like Comment
To view or add a comment, sign in

5,832 followers

View Profile Follow

Why streaming data needs a lakehouse revolution

More from this author

Welcome to the April Edition of the Data Streaming Newsletter!

Open-Sourcing the Leaderless Log Protocol Behind Ursa

We Are a Kafka Company, Too

Explore content categories