Reference Architecture for Stream Processing in Hadoop

Vijay Mandava

Published Mar 11, 2015

With the advent of Hadoop 2, the breadth of use cases that can leverage Hadoop ecosystem have increased. Running analytics on a real time stream is a common use case we have been hearing from customers. Variations of this reference architecture can be used to implement these use cases. The sources of this real time stream can be IoT sensor events, log files, social media events, click stream events etc.

In this picture, S1,S2…S6 are sensors that are emitting events. Consolidators are components that can aggregate these events and they can become Apache Flume sources or Apache Kafka publishers or sources for Storm spouts.

Depending upon the use cases, a subset of this reference architecture can be used. Here are some considerations –

Unless you need to do parallel complex processing, Flume interceptors will be able to handle the modification/deletion/processing of events.
If a flume source is already available that might be a good entry point instead of writing a Kafka publisher.
Flume source supports both pull and push model
Kafka producer uses push model
Use Kafka for High Availability (HA) of Events during processing. Kafka can be used as both a Flume source and a Flume sink. Kafka leverages Zookeeper to get the High Availability semantics

Recently Cloudera implemented Flume-Kafka integration so you have built in Kafka sinks and Kafka sources for Flume. Cloudera also implemented a Kafka channel in Flume thus improving the High Availability for the channel.

Flume events can be persisted in built in sinks such as HDFS, Avro, Thrift, HBase, Solr,ElasticSearch. Custom sinks can be written to persist to other databases such as NoSQL or relational databases(RDBs). Hortonworks also provided built in connectors for Storm bolts to stream data to HDFS or HBase.

If applications need “Exactly Once” semantics for the Storm tuples, they can leverage Trident framework.

Web applications can provide real time visualization of these events from the NoSQL and RDBs. Business Analysts can use their BI tools to analyze data using Hive/Impala on top of HDFS/HBase.

Sankar Kavuluri 11y

Good post Vijay.

To view or add a comment, sign in

More articles by Vijay Mandava

Building an AI-First Organization Is Not What Most Companies Think

Apr 28, 2026

Building an AI-First Organization Is Not What Most Companies Think

Most companies believe becoming “AI-first” means adopting new tools. It doesn’t.

2 Comments
Why Enterprise AI Needs a Harness Platform Layer

Apr 19, 2026

Why Enterprise AI Needs a Harness Platform Layer

Over the past year, one thing has become clear: AI success is not about choosing the best model—it’s about building the…
Agentic Engineering

Feb 17, 2026

Agentic Engineering

A Governed Path Between Traditional Delivery and “Vibe Coding” Executive Summary Enterprise IT organizations are moving…
Enterprise AI Advantage Is Built on Harnesses, Not Models

Feb 6, 2026

Enterprise AI Advantage Is Built on Harnesses, Not Models

The companies winning with Enterprise AI aren't picking better models. They're building better harnesses.

2 Comments
Moltbook: Watching the AI Agents Socialize (and Spiral)

Feb 1, 2026

Moltbook: Watching the AI Agents Socialize (and Spiral)

Moltbook is an AI-only social network where autonomous agents post, reply, and organize — while humans watch from the…

1 Comment
Agent-Native Execution: How Skills, Tools, and Hooks Create Scalable AI

Feb 1, 2026

Agent-Native Execution: How Skills, Tools, and Hooks Create Scalable AI

Most AI strategies fall into two traps: building a large, monolithic agent that becomes expensive and rigid, or…

1 Comment
Agent-Native Architecture: Strategic Shifts for the AI-First Enterprise

Jan 14, 2026

Agent-Native Architecture: Strategic Shifts for the AI-First Enterprise

AI is forcing a deeper rethink of software architecture than cloud or microservices ever did. The real shift is not…

2 Comments
Agent Native Architecture - Bash is all you need

Jan 10, 2026

Agent Native Architecture - Bash is all you need

In conversations about AI agents, a phrase has recently gained traction: “Bash is all you need.” At first, it sounds…

4 Comments
What It Takes to Operationalize AI Agents at Scale

Dec 8, 2025

What It Takes to Operationalize AI Agents at Scale

The biggest breakthroughs my teams have achieved with AI agents didn’t come from more capable models—they came from…

3 Comments
AI Is Transforming Coding — Architecture Still Belongs to Us: Practical Guidelines for Engineering Teams

Dec 1, 2025

AI Is Transforming Coding — Architecture Still Belongs to Us: Practical Guidelines for Engineering Teams

Generative AI is transforming the way software is built. It can scaffold entire applications, generate tests, and…

4 Comments

See all articles

Reference Architecture for Stream Processing in Hadoop

Vijay Mandava

More articles by Vijay Mandava

Others also viewed

Flume architecture and concepts

Big Data != Hadoop

Contribution of fixed storage by DataNode to Hadoop Cluster

Introduction:

Hadoop Cluster Architecture

Spark vs. Hadoop: A Comprehensive Comparison for Big Data Processing

Hadoop and Big Data no longer runs on Commodity Hardware

Distibuted storage cluster and Hadoop

Hadoop Summit Dublin Preview - Wednesday

Unnoticed Gem - HBASE

Explore content categories

More articles by Vijay Mandava

Building an AI-First Organization Is Not What Most Companies Think

Why Enterprise AI Needs a Harness Platform Layer

Agentic Engineering

Enterprise AI Advantage Is Built on Harnesses, Not Models

Moltbook: Watching the AI Agents Socialize (and Spiral)

Agent-Native Execution: How Skills, Tools, and Hooks Create Scalable AI

Agent-Native Architecture: Strategic Shifts for the AI-First Enterprise

Agent Native Architecture - Bash is all you need

What It Takes to Operationalize AI Agents at Scale

AI Is Transforming Coding — Architecture Still Belongs to Us: Practical Guidelines for Engineering Teams

Others also viewed

Flume architecture and concepts

Big Data != Hadoop

Contribution of fixed storage by DataNode to Hadoop Cluster

Introduction:

Hadoop Cluster Architecture

Spark vs. Hadoop: A Comprehensive Comparison for Big Data Processing

Hadoop and Big Data no longer runs on Commodity Hardware

Distibuted storage cluster and Hadoop

Hadoop Summit Dublin Preview - Wednesday

Unnoticed Gem - HBASE

Explore content categories