6 Important Concepts in Stream Processing

Chandra Rajasekharaiah

Published Jun 19, 2015

Enterprises today store and analyze every bit of data they can find: mouse clicks, page hovers, data changes, intermediate results of computation. All that data is often collected and sent to a large data pipe. Large volume of a variety of data arrives on these data pipes--termed streams--to computing hubs. The data is of immense value both when analyzed in-toto, or analyzed in real-time. The former is big data analytics, and the latter, stream processing.

To understand stream processing, let us look at a potential application. Consider the problem of traffic congestion detection. Let us say we are collecting data from smart vehicles and smart phones: GPS coordinates along with meta data such as device identifier, timestamp, etc.,. The incoming data is of huge volume, arriving from millions of end points. To find congestion we need to do the following:

collate GPS coordinates from the same end point
compare with earlier coordinate from the same vehicle
identify stationary vehicles
detect 'clumps' where many vehicles are stationary or slowly moving
overlay with maps to eliminate parking lots
further process to ensure congestions on maps are on roads

Over the last few decades, quite a few applications of similar nature--such as body sensor data processing, market analysis, rendering of graphical models, telephone record analysis--have proven very insightful. Now stream processing has taken the next step, wherein this concept is applied to enterprises to gain insights. A lot of engineering and math is currently underway and a great deal of work is done today on stream processing. This blog attempts to capture some of the concepts to keep in mind before enterprises architect new stream processing applications. The concepts are completely different from traditional application architecture techniques.

1. Processing Data

It is important to process messages instantaneously, ensure a low latency in processing, and keep the backlog of messages to a minimum. If not, the incessant arrival of messages makes catching up an extremely difficult task. Delays can quickly spiral out of control.

Avoid expensive external calls and dependencies. Avoid SOAP web service calls, database lookups, or any I/O operations from a stream processing application.
Avoid reprocessing historic events. If the result of current message in the stream depends on earlier messages, do not reprocess those earlier messages. It is wiser to compute and cache intermittent results, so that additional values can be simply combined with cached results.
Pull data immediately as they become available. An application that periodically polls for data is highly wasteful. Data could become stale, message carriers might get flooded, a lot of bad things could happen.
Avoid complex time-consuming calculations as part of message processing. “In general, algorithms operating on streams will be restricted to fairly simple calculations because of time and space constraints.”(O'Callaghan et al).

2. Partitioning Data

Partition the incoming stream, or divide into sub-streams. Separate streams into sub-streams for like and combinable data.

Group 'like' data--for instance, data from the same geographic region in a traffic application, a particular cell/screen subsection for a graphics application, etc.--and process individual groups separately from ‘unlike’ data.

Group combinable data. Streams need data aggregation: for instance, ‘rolling up’ or aggregating data for analysis, comparison, et.c,. Pre-aggregate data from separate sources to avoid real time aggregation across large data sets. Hence it is important to separate streams into independent substreams of data. Split stream into combinable sub-streams, effectively allowing parallel execution on data.

3. Handling Data

3.1 Mercilessly discard unnecessary data.

Data filtering is an important task in handling streams: data is often times bad. Data from analog systems, such as sensors, can be atrociously dirty. This type of data needs correction, or simple discard, if outside of allowable limits.

Design algorithms to judiciously disregard/correct/consume data from stream. If it is safer to simply ignore data and wait for the next event, rather than spending efforts to correct that data point, do so.

3.2 Also, be ready to accept seemingly unnecessary data.

Though messages are expected to arrive chronologically in streams, it is possible to receive out of sequence events. Equip algorithms and data stores to handle out-of-sequence data . For instance, if you are using sliding-window for calculations, keep data open for editing for a short while after the processing window has closed. This allows recovery and update of data that is effectively ‘out-of-stream’.

4. Choosing Data Stores

λ architecture suggests that most modern day enterprises have three types of data and processing requirements. Transactional systems continue to have solid ACID requirements (though gradually diminishing to eventual consistency), that need robust RDBMS type of transaction stores. Analytics of data that needs to store all and every type of data (aka big data), that needs HDFS type of stores. The third type is the stream processing, giving a cursory look at all and every type of data, to gain real-time insights.

Stream processing applications require sub millisecond response times for data read-writes. Choose in-memory databases, clustered data grids, etc., and cushion them with a good cache to handle 'Internet speed' data. It is important to choose the data store wisely, as for many use-cases, the largest impact on performance will be from this part stream processing.

5. Choosing Underlying Frameworks

There are a multitudes of products and frameworks available for stream processing. Most of these frameworks either allow raw event processing functionality (Storm, Samza, Borealis, et.c,), SPL--stream processing language--construct based (IBM Streams, TIBCO Streambase, et.c,) or both (Storm+Trident).

Thoroughly evaluate products before choosing a framework. Before choosing, ask questions such as,

"what is my throughput requirement, in terms of events per second?"
"are my use-cases mostly orchestration of processing elements?"
"do I need to build processing units, and wire them differently based on use-case?"
"are all my use-cases (current and future) covered by one single SPL?"

6. Testing and Debugging

Debugging stream processing applications that process large amounts of data at high rates in a distributed environment is extremely hard. Typically, the debugging needs to address multiple facets: application semantics, code, deployment details, performance metrics, etc., (Gedit et al)

Debugging stream processing requires to build/buy some key features into the application frameworks. The framework should have the ability to

run the solution as-is on a test bench, without rebuilding the application
suspend processing units when events arrive
capture, inspect and change the events, also inject new events
trace the flow of an event in a separate log/visual cue
detect anomalies: stragglers, faulty processing nodes, poison events, etc.,

Most of the SPL frameworks such as IBM Streams and TIBCO Streambase usually inherently contain this functionality. Non-SPL frameworks need a different approach, such as terser unit tests, controlled run on embedded instances, etc., Evaluate this tradeoff early, pick wisely, and design (as far as possible) agnostically.

Notes and References

Big data is the new dream of organizations to store every type of data they encounter. Stream processing is the wish to process all that data and infer as much information as possible in real-time or near-real-time. This makes stream processing applications different from traditional OLTP or event-driven SOA type of applications. The concepts of architecting stream processing applications are different. Above six are some of the preliminary concepts engineers and architects have to remember before building new stream processing systems.

The design of borealis and aurora frameworks provides a good introduction to understanding how stream processing frameworks are conceived and built. There are some historically significant papers, such as the ones on scalable stream processing, eight rules of stream processing that provide good information on designing stream processing applications. There are a few papers, such as this survey of stream processing frameworks, that contain good comparison of various stream processing frameworks.

Sidd S. 10y

Another aspect is enrichment using additional data sources, which was mentioned as an activity that is best avoided. However in some circumstances all the data needed to make a decision may not be available within the event itself. This may lead to a trade-off between richness of the decision-making process and throughput.

Pranab Ghosh 10y

Some more questions to ask before choosing a frame work 1) Do I need reliable state management 2) Do I need windowing capabilities 3) In case of at least one deliver semantics, are my data store updates idempotent

1 Reaction

6 Important Concepts in Stream Processing

Chandra Rajasekharaiah

1. Processing Data

2. Partitioning Data

3. Handling Data

4. Choosing Data Stores

5. Choosing Underlying Frameworks

6. Testing and Debugging

Notes and References

More articles by this author

Others also viewed

Big Data & Business Disruption : Challenges and Opportunities

AI Without Engineering Context Produces Wrong Answers (Insights from 56M Industrial Data Points)

Event Frames: One of the Most Brilliant Ideas in Industrial Data — And Why They Matter Even More in the AI Era

Machine Intelligence – The Next Paradigm in Business Analytics?

The Crucial Role of Application Architecture in AI Implementation

NEXT Level – The 4 Success Factors for Data-driven Services

There is No AI Without Data

Automatic mapping of logistics data

TIBCO Insight Platform Powered by Artificial Intelligence

Data Ingestion: The First Step to Unlocking the Power of Your Data Pipeline

Explore content categories

1. Processing Data

2. Partitioning Data

3. Handling Data

4. Choosing Data Stores

5. Choosing Underlying Frameworks

6. Testing and Debugging

Notes and References

Building Agentic AI Systems: Core Concepts

Mar 11, 2026

AI Benchmarks I: Gemini 3 Scored 85% on ARC-AGI-2. What That Actually Proves.

Feb 18, 2026

The Projected Shape of AI in 2026

Feb 2, 2026

Among the First to Pass: Strategic Insights into the AWS Generative AI Professional Exam

Jan 17, 2026

Gemini 3.0 - Hype vs. Reality

Dec 8, 2025

AI Agents: Our Experiments, Key Implementation Techniques, Security Challenges & Observability Solutions

May 18, 2025

"Three Clouds, Three Certs, Three Weeks" AKA "Kids, Stay in School!"

Apr 9, 2021

Moving to Cloud Platforms - 1. Product Suites

Apr 21, 2019

Software Development 2018: Eisenhower Decision Matrix

Mar 16, 2018

Machine Learning in Production

Jun 24, 2017

Others also viewed

Big Data & Business Disruption : Challenges and Opportunities

AI Without Engineering Context Produces Wrong Answers (Insights from 56M Industrial Data Points)

Event Frames: One of the Most Brilliant Ideas in Industrial Data — And Why They Matter Even More in the AI Era

Machine Intelligence – The Next Paradigm in Business Analytics?

The Crucial Role of Application Architecture in AI Implementation

NEXT Level – The 4 Success Factors for Data-driven Services

There is No AI Without Data

Automatic mapping of logistics data

TIBCO Insight Platform Powered by Artificial Intelligence

Data Ingestion: The First Step to Unlocking the Power of Your Data Pipeline

Similar topics

How to Analyze Streaming Data

How to Utilize Real-Time Data Processing

Explore content categories