# More about Watermarks

***

## Watermarks

While the concept of a watermark is simple ("a timestamp marking progress"), implementing them in a distributed system is hard. This chapter answers: *How do we calculate this number when we have 1,000 servers processing data at different speeds?*

#### Watermark Creation (Source Watermarks)

Watermarks are not "global" initially; they are created at the Ingestion Point (the Source).

* Per-Partition Tracking: If you are reading from Kafka (which has partitions), the system tracks a watermark for *each partition individually*.
* The Monotonicity Rule: Watermarks must strictly increase. They can pause, but they can never go backward.
* The Input: A source (like a Kafka consumer) looks at the timestamps of the messages it is reading and uses a heuristic (e.g., `Current Time - 10 seconds`) to publish a watermark for that partition.

#### Watermark Propagation (The Flow)

Once created at the source, how does the watermark move through the graph of transformations (Map → GroupByKey → Filter)?

There is a strict mathematical rule for propagation:

$$
OutputWatermark = \min(InputWatermark, OldestActiveWork)
$$

<figure><img src="https://2332658533-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FG5fhKjYnbaQlTPTcaO85%2Fuploads%2FxNmNPNMV2fe5mH7opRUL%2FGemini_Generated_Image_yfazu0yfazu0yfaz.png?alt=media&#x26;token=ab32a917-457f-4485-be52-bc9aa7bdd00c" alt=""><figcaption></figcaption></figure>

This creates two distinct definitions you must know:

1. Input Watermark: The "completeness" of the data *entering* this stage.
   * *Formula:* $$\min(\text{Output Watermarks of all upstream stages})$$.
   * *Meaning:* If you have 10 sources, and 9 are at 12:05 but one is stuck at 12:00, your Input Watermark is 12:00. The slow consumer holds back the entire system.
2. Output Watermark: The "completeness" of the data *leaving* this stage.
   * *Formula:* $$\min(\text{Input Watermark}, \text{Timestamp of oldest active element in buffer})$$.
   * *Meaning:* Even if the input is at 12:05, if this specific worker is currently processing a file from 11:55, it cannot declare "11:55 is done." The Output Watermark holds at 11:55 until that work is finished.

#### The "Idle Source" Problem

This is a classic edge case discussed in the chapter.

* Scenario: You have two input streams. Stream A is active (sending data at 12:05). Stream B goes silent (no data since 11:00).
* The Problem: Because of the `min()` rule, the Global Watermark is stuck at 11:00 (Stream B). Stream A's windows cannot close. The system halts.
* The Solution: Idle Timeouts. Sources must declare themselves "Idle" after a period of inactivity. The system then ignores the idle source when calculating the `min()`, allowing the watermark to advance based on Stream A.

#### Percentile Watermarks (Heuristics)

How do we guess the watermark for dynamic sources?

The book discusses calculating watermarks based on distribution curves.

* The Logic: "Based on history, 90% of data arrives within 5 seconds, and 99% arrives within 1 minute."
* Adaptive Watermarks: Modern systems (like Google Cloud Dataflow) don't use fixed constants (e.g., "minus 10 seconds"). They monitor the lag distribution in real-time. If the network gets slow, the system automatically slows down the watermark (increasing latency to preserve correctness). If the network is fast, it speeds up.

#### System Latency vs. Data Latency

The chapter draws a crucial distinction between these two metrics, which are often confused:

* Data Latency (The Skew): The time between *Event Time* (when it happened) and *Processing Time* (now).
  * *Cause:* Creating the data, uploading it, network path.
* System Latency (Processing Lag): The time between *Processing Time* (arrival) and *Output*.
  * *Cause:* Your code is slow, or the CPU is maxed out.
* The Watermark's Role: Watermarks allow you to reason about Data Latency *independently* of System Latency.

***

#### Summary Diagram Logic

If you were to draw a diagram for this chapter, it would be the "Watermark Flow":

1. Source: Multiple independent partitions, each with a different timestamp.
2. Aggregation Node: A box that takes those inputs.
3. The Gatekeeper: Shows the `min()` function picking the oldest time.
4. Result: The global watermark that determines when windows fire.

***
