15 Minutes to Read "Apache Flink: Stream and Batch Processing in a Single Engine"
Definitions
Lambda Architecture = combine batch and stream processing to implement multiple paths of computation.
- A streaming fast path for timely approximate results.
- A batch offline path for late accurate results.
Kappa Architecture = a simplification of the Lambda Architecture with the batch processing system removed.
- To replace batch processing, data is simply fed through the streaming system quickly.
- Perform both real-time and batch processing with a single technology stack.
Highlights
Apache Flink is an open-source system for processing streaming and batch data.
- Philosophy: many classes of data processing applications can be executed as pipelined fault-tolerant dataflows.
- Diverse use cases can be unified under a single execution model.
- Acknowledge the need for dedicated batch processing (dealing with static data sets). (1) An efficient batch processor on top of a streaming runtime. (2) Specialized API, specialized data structures/algorithms & additional optimizations, efficient fault-tolerance, dedicated (staged) scheduling strategies.
System Architecture - Flink's runtime and APIs.
- Flink runtime = a DAG of stateful operators connected with data streams.
- DataStream and DataSet API.
- Flink cluster.
- Client = take program code → transform to dataflow graph → submit to Job Manager.
- Job Manager = coordinate the distributed execution of the dataflow.
- Task Manager = perform the actual data processing.
All Flink programs eventually compiled down to a common representation: the streaming dataflow graph, executed by Flink's runtime engine.
- Flink offers reliable execution with strict exactly-once-processing consistency guarantees and deals with failures via checkpointing and partial re-execution.
- Incremental processing and iterations are crucial for many applications. Iteration steps = special operators containing an execution graph on DAG-based runtime and scheduler.
Details
Problems of batching continuous data into static data sets or the Lambda Architecture.
- High latency imposed by batch.
- High complexity.
- Connect and orchestrate several systems.
- Implement business logic twice.
- Arbitrary inaccuracy.
- Not handle time dimension explicitly by the application code.
- Uncontrolled randomness of algorithms/function implementation and issue handling.
Data exchange through intermediate data streams.
- Pipelined data exchange = exchange data between concurrently running producers and consumers.
- Blocking data exchange = buffer all data from producers before making it available for consumption.
- Balance latency and throughput by configuring the buffer size.
- Streams communicate different types of control events, e.g. checkpoint barriers, watermarks, iteration barriers.
Flink checkpointing mechanism.
- Distributed consistent snapshots to achieve exactly-once-processing guarantees.
- Asynchronous Barrier Snapshotting (ABS) = take a consistent snapshot across all parallel operators without halting the execution of the topology. Snapshot of all operators should refer to the same logical time in the computation.
- Similar to the Chandy-Lamport algorithm. Only rely on the alignment phase to apply all their effects to the operator states without in-flight record checkpoint.
- Failure recovery.
- Revert all operator states to the last successful snapshot.
- Restart input stream from the latest barrier for which there is a snapshot.
- Partial recovery of failed subtask = replay unprocessed records buffered at the immediate upstream subtasks.
Stream analytics on top of Dataflows.
- Event time (+ watermark), processing time and ingestion time.
- Stateful stream processing = stream windows are stateful operators that assign records to continuously updated buckets kept in memory as part of the operator state.
- Stream windows = continuously evolving logical views for incremental computations over unbounded streams.
- Incorporate window within a stateful operator with three core functions: window assigner, trigger, evictor.
- Asynchronous stream iterations.
Batch analytics on top of Dataflows.
- Batch computations are executed by the same runtime as streaming computations.
- Periodic snapshotting is turned off when its overhead is high.
- Blocking operators can spill to disk if inputs exceed their memory bounds.
- Dedicated DataSet API provides familiar abstractions.
- Query optimization layer transforms the DataSet program into an efficient executable.
- Allow for any type of structured iteration logic and introduce novel optimization techniques such as delta iterations.
References
- Paper: Apache Flink: Stream and Batch Processing in a Single Engine (2015)
- https://milinda.pathirage.org/kappa-architecture.com/
- What Is the Kappa Architecture?