Introducing SideOutputs in Apache Flink
Apache Flink is by far one of best open sourced stateful streaming processing frameworks available. Just like Hadoop is an open-source implementation offering MapReduce programming model [1]. Flink is open sourced implementation of "DataFlow" model [2][3].
Unlike Dataflow paper "core primitives" ParDo described, Flink holds strong typed single input (with few exception in join case) and output constraint. Type constraint is enforced in both API as well as runtime context where any processor is required to specify input type(s) and single output type.
Why Sideoutputs matters? Flink API already offers spliting output with string tags to different streams. split/select pattern seems sufficient to deal with stateless processor where output is solely derived from a limited set of input event from upstreams. A good example would be calculating GPS pins in both city and country level at same time. A GPS event could be assigned tags of both "Seattle" as well as "U.S". Downstream city level operator may select event with "Seattle" versus country level operator may select event with "U.S" and calculate results.
Sideoutputs is designed to fit in native stateful processor where Flink shines compare to other frameworks like Apache Samza[4]. Thanks to control messages introduced in ABS[4] paper, States in Flink supports Exact-Once semantic without leveraging external resources in exeution runtime.
In my opinion, side outputs basicly enables Flink users expose Exact-Once context within stateful processor in parallel of incremantal events processor(delta processor). Context is a broad definition depending on users' specific use cases. It could be
- debugging stack trace if exception was throw within stateful processor. [6]
- related raw events when derived statistic event emit to downstream.(e.g list of transactions for audit purpose when a charge value proposed to a customer's billing account)
- late arriving events when a bounded window expired and recycled.(e.g user's cloud services usage log after normal calculation cycle--72 hours for example) See [7] in detail.
SideOutputs[6] and SideInputs[8](in progress) are essential parts to achieve N inputs N outputs ParDo core primitive described in Dataflow model[2]. There are sizable refactor to enable N in/outputs model and some API trade offs made to honor binary backward compatibility. Some really interesting questions (e.g stream starvation) also emerged and yet to be solved completely. There might be a time when Flink should rearchitect API as well as certain part in runtime above network layer, drop backward compatibility and make API looks pretty :) For now, there are lots of excitement around StreamSQL. It empowers data analysts write SQL to crack exact once stateful stream framework power and bring value to products. Personally, I am very excited to see progress made in that front.[9]
Feel free to contact me if you have any questions or feature requests regarding to side outputs. #FlinkForward
References:
[1] MapReduce: Simplified Data Processing on Large Clusters
[3] MillWheel: Fault-Tolerant Stream Processing at Internet Scale
[5] Lightweight Asynchronous Snapshots for Distributed Dataflows
[6] SideOutputs in Flink 1.3 Doc
[7] Get Late Data as SideOutput
[9] Uber AthenaX