More about Watermarks


Watermarks

While the concept of a watermark is simple ("a timestamp marking progress"), implementing them in a distributed system is hard. This chapter answers: How do we calculate this number when we have 1,000 servers processing data at different speeds?

Watermark Creation (Source Watermarks)

Watermarks are not "global" initially; they are created at the Ingestion Point (the Source).

  • Per-Partition Tracking: If you are reading from Kafka (which has partitions), the system tracks a watermark for each partition individually.

  • The Monotonicity Rule: Watermarks must strictly increase. They can pause, but they can never go backward.

  • The Input: A source (like a Kafka consumer) looks at the timestamps of the messages it is reading and uses a heuristic (e.g., Current Time - 10 seconds) to publish a watermark for that partition.

Watermark Propagation (The Flow)

Once created at the source, how does the watermark move through the graph of transformations (Map → GroupByKey → Filter)?

There is a strict mathematical rule for propagation:

OutputWatermark=min(InputWatermark,OldestActiveWork)OutputWatermark = \min(InputWatermark, OldestActiveWork)

This creates two distinct definitions you must know:

  1. Input Watermark: The "completeness" of the data entering this stage.

    • Formula: min(Output Watermarks of all upstream stages)\min(\text{Output Watermarks of all upstream stages}).

    • Meaning: If you have 10 sources, and 9 are at 12:05 but one is stuck at 12:00, your Input Watermark is 12:00. The slow consumer holds back the entire system.

  2. Output Watermark: The "completeness" of the data leaving this stage.

    • Formula: min(Input Watermark,Timestamp of oldest active element in buffer)\min(\text{Input Watermark}, \text{Timestamp of oldest active element in buffer}).

    • Meaning: Even if the input is at 12:05, if this specific worker is currently processing a file from 11:55, it cannot declare "11:55 is done." The Output Watermark holds at 11:55 until that work is finished.

The "Idle Source" Problem

This is a classic edge case discussed in the chapter.

  • Scenario: You have two input streams. Stream A is active (sending data at 12:05). Stream B goes silent (no data since 11:00).

  • The Problem: Because of the min() rule, the Global Watermark is stuck at 11:00 (Stream B). Stream A's windows cannot close. The system halts.

  • The Solution: Idle Timeouts. Sources must declare themselves "Idle" after a period of inactivity. The system then ignores the idle source when calculating the min(), allowing the watermark to advance based on Stream A.

Percentile Watermarks (Heuristics)

How do we guess the watermark for dynamic sources?

The book discusses calculating watermarks based on distribution curves.

  • The Logic: "Based on history, 90% of data arrives within 5 seconds, and 99% arrives within 1 minute."

  • Adaptive Watermarks: Modern systems (like Google Cloud Dataflow) don't use fixed constants (e.g., "minus 10 seconds"). They monitor the lag distribution in real-time. If the network gets slow, the system automatically slows down the watermark (increasing latency to preserve correctness). If the network is fast, it speeds up.

System Latency vs. Data Latency

The chapter draws a crucial distinction between these two metrics, which are often confused:

  • Data Latency (The Skew): The time between Event Time (when it happened) and Processing Time (now).

    • Cause: Creating the data, uploading it, network path.

  • System Latency (Processing Lag): The time between Processing Time (arrival) and Output.

    • Cause: Your code is slow, or the CPU is maxed out.

  • The Watermark's Role: Watermarks allow you to reason about Data Latency independently of System Latency.


Summary Diagram Logic

If you were to draw a diagram for this chapter, it would be the "Watermark Flow":

  1. Source: Multiple independent partitions, each with a different timestamp.

  2. Aggregation Node: A box that takes those inputs.

  3. The Gatekeeper: Shows the min() function picking the oldest time.

  4. Result: The global watermark that determines when windows fire.


Last updated