Stream processing for connected assets: Learnings and challenges

Stream processing for connected assets: Learnings and challenges

Given the rapid technological advancements leading to a large variety of assets getting connected and their data being made available for near real time decision making, the need for processing these data streams using standard off the shelf frameworks is the need of the hour for an IOT solution. There are various stream processing frameworks that have emerged in the recent past, of which Apache Flink is definitely a promising one.

In this post, we will lay out the unique characteristics of IOT data (as compared to say financial or ecommerce or devops data streams), usage of Apache Flink for processing such streams, the challenges encountered and the potential solutions for the same. We conclude by a few suggestions for review by the Flink community.

Data & processing characteristics

1.      Data pattern from connected assets can be at different frequencies (in event time, e.g., every minute, every second, every 5 minutes). To detect or predict extremely transient conditions, we have seen use cases especially in the Industrial space, which need of data at very high resolution even at a second level

2.      Data pattern from connected assets can be asynchronous/non-periodic, which means data is not uniformly spaced in event time. Such a technique is usually followed in scenarios where there is a high sensitivity to bandwidth consumption e.g., in the marine world

3.      Data may not arrive at the same time it was generated and can arrive late due to latencies, network connectivity related scenarios or might arrive late due to regulatory restrictions. Time synchronization with an external NTP server cannot be expected due to the similar connectivity constraints.

4.      Some analytical models work on a combination of “raw data” (data from the assets) and “derived data” (data from another analytical model), leading to situations where not all data needed by a model will be delivered at the “same time”

5.      An analytical model is usually specific to a type of product or in some situations the instance of the product and only need a subset of the data from the raw data stream for processing tumbling, sliding windowing and batch models

Key Apache Flink features utilized

1.      Event time, windowing and watermarking features. With Watermark support, Flink provides flexible mechanism to trade off the latency and completeness of results.

2.      Flink support of Rock DB and Object stores such as AWS S3 as a state backend proved helpful to solve such short-term frequent caching or persistence needs This is very useful for analytical models, where there is a vast difference in the rate of input data and the output generated, there is need to persist huge amounts of state (in GB) frequently (in order of minutes).

3.      Processing-time semantics which performs computations as triggered by the wall-clock time of the processing machine.

4.      Python function as part of PyFlink – this comes handy for simple calculations on streaming data

5.      Elastic scale as more and more assets start communicating or when the periodicity of data is changed to a higher frequency of already connected assets or as a greater number of analytical models are deployed

 

Challenges faced when using Apache Flink

1.      Operator instance level watermarks

Most of the analytical models are designed for a given product type as they have commonality in the expected insights and have the same set of sensor data. To get insights at an asset level from such ‘generic’ jobs, asset specific input data streams are fed into the model and flow through the system as parallel streams. In such scenarios, Flink expects

o  All inputs streams for a given model to be time aligned

o  Any delay across any of the parallel streams is same and occurs at the same time

Essentially watermarks in Flink are set at an Operator instance level and not at a “Key level” (the key being the asset in this case). Hence employing a single Flink job to derive insights for multiple assets is a challenge.

A simple but ineffective solution is to have a separate job for each asset. But this will lead to be an operational nightmare. Our proposed solution is to exploit the concept of custom windowing and use local machine time (system time) for window eviction, instead of using the native Flink watermarking concept.

2.      Memory challenges for processing large overlapping data sets (sliding window)

Consider a scenario, when an analytical model requires multiple sensor data points (e.g., 100) and each sensor data is streamed at a 1 second resolution. The model is sliding window based and require last 60 minute of data to generate output every minute. This would mean a total of 360K points (100 points * 60*60) for each execution per minute. Further if these models are relevant for multiple assets, the memory requirement grows linearly. 

Our proposed solution to prevent the such high memory consumption is to use the Flink state function and alongside implement a custom windowing solution.

3.      Load sharing in a Flink cluster

In situations where there are various analytic models to be supported and each analytics job has a different resource requirement (parallelism, memory etc), deploying all of them in a single cluster is not straightforward and can lead to sub optimal resource utilization. This could be alleviated to a certain degree by “grouping” analytical models with similar behaviour to a separate cluster

Suggestion on potential enhancements

1.      Support for key based watermarking feature is highly desirable for many IOT use cases.

2.      Support for resource management for different workloads Flink community is working to address this requirement

3.      Native flink watermarking and windowing support features are useful for processing large spike of data (spread across time) in a very short span of time e.g., when assets are online after being offline for a period of time. While this support is useful for most cases of spike ingestion, there might be cases of timer de-duplication leading to loss of data. Such handling at native level will be use instead of user creating a custom solution.

Authors - Sushrutha Bankapura | Santosh Agarwal

To view or add a comment, sign in

More articles by Santosh Agarwal

Others also viewed

Explore content categories