6 Important Concepts in Stream Processing
Enterprises today store and analyze every bit of data they can find: mouse clicks, page hovers, data changes, intermediate results of computation. All that data is often collected and sent to a large data pipe. Large volume of a variety of data arrives on these data pipes--termed streams--to computing hubs. The data is of immense value both when analyzed in-toto, or analyzed in real-time. The former is big data analytics, and the latter, stream processing.
To understand stream processing, let us look at a potential application. Consider the problem of traffic congestion detection. Let us say we are collecting data from smart vehicles and smart phones: GPS coordinates along with meta data such as device identifier, timestamp, etc.,. The incoming data is of huge volume, arriving from millions of end points. To find congestion we need to do the following:
- collate GPS coordinates from the same end point
- compare with earlier coordinate from the same vehicle
- identify stationary vehicles
- detect 'clumps' where many vehicles are stationary or slowly moving
- overlay with maps to eliminate parking lots
- further process to ensure congestions on maps are on roads
Over the last few decades, quite a few applications of similar nature--such as body sensor data processing, market analysis, rendering of graphical models, telephone record analysis--have proven very insightful. Now stream processing has taken the next step, wherein this concept is applied to enterprises to gain insights. A lot of engineering and math is currently underway and a great deal of work is done today on stream processing. This blog attempts to capture some of the concepts to keep in mind before enterprises architect new stream processing applications. The concepts are completely different from traditional application architecture techniques.
1. Processing Data
It is important to process messages instantaneously, ensure a low latency in processing, and keep the backlog of messages to a minimum. If not, the incessant arrival of messages makes catching up an extremely difficult task. Delays can quickly spiral out of control.
- Avoid expensive external calls and dependencies. Avoid SOAP web service calls, database lookups, or any I/O operations from a stream processing application.
- Avoid reprocessing historic events. If the result of current message in the stream depends on earlier messages, do not reprocess those earlier messages. It is wiser to compute and cache intermittent results, so that additional values can be simply combined with cached results.
-
Pull data immediately as they become available. An application that periodically polls for data is highly wasteful. Data could become stale, message carriers might get flooded, a lot of bad things could happen.
-
Avoid complex time-consuming calculations as part of message processing. “In general, algorithms operating on streams will be restricted to fairly simple calculations because of time and space constraints.”(O'Callaghan et al).
2. Partitioning Data
Partition the incoming stream, or divide into sub-streams. Separate streams into sub-streams for like and combinable data.
Group 'like' data--for instance, data from the same geographic region in a traffic application, a particular cell/screen subsection for a graphics application, etc.--and process individual groups separately from ‘unlike’ data.
Group combinable data. Streams need data aggregation: for instance, ‘rolling up’ or aggregating data for analysis, comparison, et.c,. Pre-aggregate data from separate sources to avoid real time aggregation across large data sets. Hence it is important to separate streams into independent substreams of data. Split stream into combinable sub-streams, effectively allowing parallel execution on data.
3. Handling Data
3.1 Mercilessly discard unnecessary data.
Data filtering is an important task in handling streams: data is often times bad. Data from analog systems, such as sensors, can be atrociously dirty. This type of data needs correction, or simple discard, if outside of allowable limits.
Design algorithms to judiciously disregard/correct/consume data from stream. If it is safer to simply ignore data and wait for the next event, rather than spending efforts to correct that data point, do so.
3.2 Also, be ready to accept seemingly unnecessary data.
Though messages are expected to arrive chronologically in streams, it is possible to receive out of sequence events. Equip algorithms and data stores to handle out-of-sequence data . For instance, if you are using sliding-window for calculations, keep data open for editing for a short while after the processing window has closed. This allows recovery and update of data that is effectively ‘out-of-stream’.
4. Choosing Data Stores
λ architecture suggests that most modern day enterprises have three types of data and processing requirements. Transactional systems continue to have solid ACID requirements (though gradually diminishing to eventual consistency), that need robust RDBMS type of transaction stores. Analytics of data that needs to store all and every type of data (aka big data), that needs HDFS type of stores. The third type is the stream processing, giving a cursory look at all and every type of data, to gain real-time insights.
Stream processing applications require sub millisecond response times for data read-writes. Choose in-memory databases, clustered data grids, etc., and cushion them with a good cache to handle 'Internet speed' data. It is important to choose the data store wisely, as for many use-cases, the largest impact on performance will be from this part stream processing.
5. Choosing Underlying Frameworks
There are a multitudes of products and frameworks available for stream processing. Most of these frameworks either allow raw event processing functionality (Storm, Samza, Borealis, et.c,), SPL--stream processing language--construct based (IBM Streams, TIBCO Streambase, et.c,) or both (Storm+Trident).
Thoroughly evaluate products before choosing a framework. Before choosing, ask questions such as,
- "what is my throughput requirement, in terms of events per second?"
- "are my use-cases mostly orchestration of processing elements?"
- "do I need to build processing units, and wire them differently based on use-case?"
- "are all my use-cases (current and future) covered by one single SPL?"
6. Testing and Debugging
Debugging stream processing applications that process large amounts of data at high rates in a distributed environment is extremely hard. Typically, the debugging needs to address multiple facets: application semantics, code, deployment details, performance metrics, etc., (Gedit et al)
Debugging stream processing requires to build/buy some key features into the application frameworks. The framework should have the ability to
- run the solution as-is on a test bench, without rebuilding the application
- suspend processing units when events arrive
- capture, inspect and change the events, also inject new events
- trace the flow of an event in a separate log/visual cue
- detect anomalies: stragglers, faulty processing nodes, poison events, etc.,
Most of the SPL frameworks such as IBM Streams and TIBCO Streambase usually inherently contain this functionality. Non-SPL frameworks need a different approach, such as terser unit tests, controlled run on embedded instances, etc., Evaluate this tradeoff early, pick wisely, and design (as far as possible) agnostically.
Notes and References
Big data is the new dream of organizations to store every type of data they encounter. Stream processing is the wish to process all that data and infer as much information as possible in real-time or near-real-time. This makes stream processing applications different from traditional OLTP or event-driven SOA type of applications. The concepts of architecting stream processing applications are different. Above six are some of the preliminary concepts engineers and architects have to remember before building new stream processing systems.
The design of borealis and aurora frameworks provides a good introduction to understanding how stream processing frameworks are conceived and built. There are some historically significant papers, such as the ones on scalable stream processing, eight rules of stream processing that provide good information on designing stream processing applications. There are a few papers, such as this survey of stream processing frameworks, that contain good comparison of various stream processing frameworks.
Another aspect is enrichment using additional data sources, which was mentioned as an activity that is best avoided. However in some circumstances all the data needed to make a decision may not be available within the event itself. This may lead to a trade-off between richness of the decision-making process and throughput.
Some more questions to ask before choosing a frame work 1) Do I need reliable state management 2) Do I need windowing capabilities 3) In case of at least one deliver semantics, are my data store updates idempotent
wait, you worked for... Atos???