Reference Architecture for Stream Processing in Hadoop

With the advent of Hadoop 2, the breadth of use cases that can leverage Hadoop ecosystem have increased. Running analytics on a real time stream is a common use case we have been hearing from customers. Variations of this reference architecture can be used to implement these use cases. The sources of this real time stream can be IoT sensor events, log files, social media events, click stream events etc.

In this picture, S1,S2…S6 are sensors that are emitting events. Consolidators are components that can aggregate these events and they can become Apache Flume sources or Apache Kafka publishers or sources for Storm spouts.

Depending upon the use cases, a subset of this reference architecture can be used. Here are some considerations –

  • Unless you need to do parallel complex processing, Flume interceptors will be able to handle the modification/deletion/processing of events.
  • If a flume source is already available that might be a good entry point instead of writing a Kafka publisher.
  • Flume source supports both pull and push model
  • Kafka producer uses push model
  • Use Kafka for High Availability (HA) of Events during processing. Kafka can be used as both a Flume source and a Flume sink. Kafka leverages Zookeeper to get the High Availability semantics

Recently Cloudera implemented Flume-Kafka integration so you have built in Kafka sinks and Kafka sources for Flume. Cloudera also implemented a Kafka channel in Flume thus improving the High Availability for the channel.

Flume events can be persisted in built in sinks such as HDFS, Avro, Thrift, HBase, Solr,ElasticSearch. Custom sinks can be written to persist to other databases such as NoSQL or relational databases(RDBs). Hortonworks also provided built in connectors for Storm bolts to stream data to HDFS or HBase.

If applications need “Exactly Once” semantics for the Storm tuples, they can leverage Trident framework.

Web applications can provide real time visualization of these events from the NoSQL and RDBs. Business Analysts can use their BI tools to analyze data using Hive/Impala on top of HDFS/HBase.

To view or add a comment, sign in

More articles by Vijay Mandava

Others also viewed

Explore content categories