15 Minutes to Read "Apache Flink: Stream and Batch Processing in a Single Engine"

Chenya Zhang

Published May 18, 2020

+ Follow

Definitions

Lambda Architecture = combine batch and stream processing to implement multiple paths of computation.

A streaming fast path for timely approximate results.
A batch offline path for late accurate results.

Kappa Architecture = a simplification of the Lambda Architecture with the batch processing system removed.

To replace batch processing, data is simply fed through the streaming system quickly.
Perform both real-time and batch processing with a single technology stack.

Highlights

Apache Flink is an open-source system for processing streaming and batch data.

Philosophy: many classes of data processing applications can be executed as pipelined fault-tolerant dataflows.
Diverse use cases can be unified under a single execution model.
Acknowledge the need for dedicated batch processing (dealing with static data sets). (1) An efficient batch processor on top of a streaming runtime. (2) Specialized API, specialized data structures/algorithms & additional optimizations, efficient fault-tolerance, dedicated (staged) scheduling strategies.

System Architecture - Flink's runtime and APIs.

Flink runtime = a DAG of stateful operators connected with data streams.
DataStream and DataSet API.
Flink cluster.

Client = take program code → transform to dataflow graph → submit to Job Manager.
Job Manager = coordinate the distributed execution of the dataflow.
Task Manager = perform the actual data processing.

All Flink programs eventually compiled down to a common representation: the streaming dataflow graph, executed by Flink's runtime engine.

Flink offers reliable execution with strict exactly-once-processing consistency guarantees and deals with failures via checkpointing and partial re-execution.
Incremental processing and iterations are crucial for many applications. Iteration steps = special operators containing an execution graph on DAG-based runtime and scheduler.

Details

Problems of batching continuous data into static data sets or the Lambda Architecture.

High latency imposed by batch.
High complexity.

Connect and orchestrate several systems.
Implement business logic twice.

Arbitrary inaccuracy.

Not handle time dimension explicitly by the application code.
Uncontrolled randomness of algorithms/function implementation and issue handling.

Data exchange through intermediate data streams.

Pipelined data exchange = exchange data between concurrently running producers and consumers.
Blocking data exchange = buffer all data from producers before making it available for consumption.
Balance latency and throughput by configuring the buffer size.
Streams communicate different types of control events, e.g. checkpoint barriers, watermarks, iteration barriers.

Flink checkpointing mechanism.

Distributed consistent snapshots to achieve exactly-once-processing guarantees.

Asynchronous Barrier Snapshotting (ABS) = take a consistent snapshot across all parallel operators without halting the execution of the topology. Snapshot of all operators should refer to the same logical time in the computation.

Similar to the Chandy-Lamport algorithm. Only rely on the alignment phase to apply all their effects to the operator states without in-flight record checkpoint.
Failure recovery.

Revert all operator states to the last successful snapshot.
Restart input stream from the latest barrier for which there is a snapshot.
Partial recovery of failed subtask = replay unprocessed records buffered at the immediate upstream subtasks.

Stream analytics on top of Dataflows.

Event time (+ watermark), processing time and ingestion time.
Stateful stream processing = stream windows are stateful operators that assign records to continuously updated buckets kept in memory as part of the operator state.
Stream windows = continuously evolving logical views for incremental computations over unbounded streams.
Incorporate window within a stateful operator with three core functions: window assigner, trigger, evictor.
Asynchronous stream iterations.

Batch analytics on top of Dataflows.

Batch computations are executed by the same runtime as streaming computations.
Periodic snapshotting is turned off when its overhead is high.
Blocking operators can spill to disk if inputs exceed their memory bounds.
Dedicated DataSet API provides familiar abstractions.
Query optimization layer transforms the DataSet program into an efficient executable.
Allow for any type of structured iteration logic and introduce novel optimization techniques such as delta iterations.

References

To view or add a comment, sign in

15 Minutes to Read "Apache Flink: Stream and Batch Processing in a Single Engine"

Chenya Zhang

Definitions

Highlights

Details

References

More articles by Chenya Zhang

Others also viewed

MLOps on Databricks: My Field-Tested Blueprint from Prototype to Production

Batch vs Stream Processing

Spark on k8s

Databricks Lakeflow Declarative Pipelines 101

A Practical Taxonomy of Data Ingestion Patterns (and When Each Wins)

Understanding Bronze-Silver-Gold (Medallion) Architecture

The AI Revolution in Data Ingestion: Why Your Pipeline Architecture Can't Afford to Stand Still

Engineering the Decision Stack — Edition #1

Metadata driven ingestion with Databricks Declarative Pipelines using DLT-META framework

The Declarative Way: Databricks NRT Event Ingestion Using Medallion Architecture (Part 1)

Explore content categories

Definitions

Highlights

Details

References

More articles by Chenya Zhang

15 Minutes to Read "Structured Streaming: A Declarative API for Real-Time Applications in Apache Spark"

Others also viewed

MLOps on Databricks: My Field-Tested Blueprint from Prototype to Production

Batch vs Stream Processing

Spark on k8s

Databricks Lakeflow Declarative Pipelines 101

A Practical Taxonomy of Data Ingestion Patterns (and When Each Wins)

Understanding Bronze-Silver-Gold (Medallion) Architecture

The AI Revolution in Data Ingestion: Why Your Pipeline Architecture Can't Afford to Stand Still

Engineering the Decision Stack — Edition #1

Metadata driven ingestion with Databricks Declarative Pipelines using DLT-META framework

The Declarative Way: Databricks NRT Event Ingestion Using Medallion Architecture (Part 1)

Similar topics

Batch Processing in Big Data

Ensuring Reliable Execution Flow in Salesforce Code

Explore content categories