Watermarks


In the book Streaming Systems, these are the three foundational pillars of the Dataflow Model. Each tackles a distinct problem in the "What, Where, When, How" framework, and they are complex enough to deserve their own dedicated sections in your notes.

Here is the deep dive for the first concept: Watermarks.

Watermarks: Measuring Progress in Event Time

The Core Problem:

When processing infinite, unordered data using Event Time windows, we face a paradox:

  • We want to group data by when it happened (e.g., 12:00–12:05).

  • But data can arrive late (due to network lag, offline devices, etc.).

  • The Question: How long do we wait for the 12:00 data to finish arriving? 1 minute? 10 minutes? Forever?

If we wait too long, latency becomes unacceptable. If we don't wait long enough, our results are wrong.

Definition

A Watermark is a monotonically increasing timestamp XX that represents the system's "belief" about input completeness.

The Guarantee: "A watermark of time XX declares that we have observed all events with an event time less than XX, and we will never see any more."

In simpler terms: It is the system drawing a line in the sand and saying, "I am calling it. The 12:00 window is officially over. Any data from 12:00 that arrives after this point is officially Late."

Types of Watermarks

There are two ways to generate a watermark, depending on your knowledge of the input data.

A. Perfect Watermarks

  • Context: Used when you have perfect knowledge of the input (e.g., reading a static log file, or a Kafka topic with fixed partitions and monotonically increasing timestamps).

  • Behavior: The system knows exactly where the "end" of the data is.

  • Pros: 100% Completeness. No late data ever.

  • Cons: The watermark is blocked by the "slowest" element. If one log file is stuck, the entire system pauses. High Latency.

B. Heuristic Watermarks

  • Context: Used in distributed, dynamic inputs (e.g., mobile devices, Google Cloud Pub/Sub, Internet sockets) where you cannot know the global state of all data.

  • Behavior: The system estimates the watermark using algorithms (e.g., "Take the oldest timestamp seen minus a 10-second buffer," or "Track network partition stats").

  • Pros: The system keeps moving. It doesn't get blocked by one slow device. Low Latency.

  • Cons: The Watermark can be wrong. It might declare 12:00 "done," only for a 12:00 event to arrive 1 second later. This creates Late Data.

The Watermark Trade-off

A heuristic watermark is a knob you turn to balance Latency vs. Correctness.

  1. Too Fast (Aggressive Watermark):

    • Behavior: The watermark advances closely behind real-time.

    • Result: Low Latency (windows close quickly), but High Error (lots of late data is dropped or requires correction).

  2. Too Slow (Conservative Watermark):

    • Behavior: The watermark lags far behind real-time (waiting for stragglers).

    • Result: High Accuracy (almost no late data), but High Latency (you wait minutes/hours for the window to close).

Summary

  • Role: Answers the question "When are we done?"

  • Mechanism: A timestamp tracking input completeness.

  • Reality: Usually heuristic (an estimate).

  • Consequence: Because heuristics can be wrong, Watermarks imply the existence of Late Data. We need a way to handle that.

Watermarks tell us when the window should close. But what if we want to see results before it closes? Or update them after it closes (if the watermark was wrong)? This requires decoupling the reporting time from the watermark.

That is the job of Triggers, which we will discuss on the next page.


Last updated