Flink's Dataflow model


The good news is that while APIs evolve, the Dataflow Model is the mathematical and architectural foundation of Flink. These concepts remain 100% relevant because they define how any modern, distributed stream processor handles "unbounded" data.


The Dataflow Graph

At its simplest, a Flink program is a Directed Acyclic Graph (DAG).

  • Data Streams: The edges in the graph. These represent the continuous flow of records.

  • Operators: The nodes in the graph. They perform computations (like Map, Filter, or Window).

  • Logical vs. Physical Graphs: You write a logical graph (your code), which Flink optimizes into a physical graph that can be distributed across multiple worker nodes.

This represents the logical flow of data from a source through various transformations to a destination.

spinner

Parallelism and Data Partitioning

Since Flink is a distributed system, it splits streams into stream partitions and operators into operator subtasks.

  • One-to-one (Forwarding): Data stays in the same partition (e.g., a Map following a Source).

  • Redistributing: Data is reshuffled based on a key or a random distribution (e.g., keyBy). This is where "shuffles" happen, which are expensive because they involve network communication.

This illustrates how tasks are distributed. Notice how Map keeps data local (One-to-one), while KeyBy forces a Redistribute shuffle across the network.

spinner

Windows and Triggers

Since a stream never ends, you can't "sort" or "aggregate" the whole thing. You must slice it into Windows.

  • Tumbling Windows: Fixed-size, non-overlapping (e.g., every 5 minutes).

  • Sliding Windows: Overlapping windows (e.g., every 5 minutes, starting every 1 minute).

  • Session Windows: Grouped by periods of activity followed by a gap of silence.

This visualizes how time-based streams are sliced.

spinner

Time Semantics

This is arguably the most important terminology in the book:

  • Event Time: When the event actually happened (based on a timestamp inside the data). This is the most accurate but hardest to handle.

  • Processing Time: The time on the clock of the machine doing the work. Easy to use, but results are inconsistent if the system slows down.

  • Watermarks: The "clock" for Event Time. A watermark is a special marker that tells the system, "I'm reasonably sure no more data with a timestamp older than XX is coming."

This shows the relationship between the stream of events and the Watermark "clock" that follows them to handle lateness.

spinner

(Note: In a real stream, the Watermark tells the operator: "Assume no more data prior to 10:00 will arrive.")

State and Checkpoints

Unlike "stateless" functions, Flink remembers things.

  • Keyed State: Information stored relative to a specific key (e.g., a running total for a specific UserID).

  • Checkpoints: Periodic, internal "snapshots" of the state. If a server dies, Flink rolls back to the last checkpoint and resumes.

  • Savepoints: Manually triggered snapshots. These are "stop-motion" frames that allow you to update your code or migrate clusters without losing your progress.

This demonstrates how Flink takes a "snapshot" of the current operator memory and stores it in durable storage (like S3 or HDFS) for fault tolerance.

spinner

Key Terminology Quick-Ref

Term

Definition

Backpressure

When a downstream operator is slower than an upstream one, causing the flow to throttle.

Lateness

Records that arrive after the Watermark has already passed their timestamp.

Operator State

State that is bound to a parallel operator instance rather than a specific key.


Last updated