Triggers
Triggers: Decoupling Output from Completeness
The Core Problem: Watermarks provide a useful reference point for "Completeness" (when the window should be done). However, relying solely on the Watermark to emit results is problematic:
Latency: If the Watermark is slow (conservative), you might wait an hour to see a result that was mostly ready in 5 minutes.
Correctness: If the Watermark is fast (aggressive) and wrong, it closes the window. If late data arrives afterwards, a Watermark-only system would drop it.
The Solution: Triggers Triggers are a mechanism to declare when a pane of the window should be materialized (output), independent of the Watermark status.
The Concept: While the Watermark says "I think we are done," the Trigger says "I want to see the result NOW."

The Three Types of Triggers
By combining Watermarks and Triggers, we can output results at three distinct phases of the window's lifecycle.
A. Early Triggers (Speculative Results)
Definition: Firing before the Watermark passes the end of the window.
Logic: "Update this window every minute," or "Update every time we receive 1,000 records."
Goal: Low Latency.
Use Case: Real-time dashboards. You want to see the "Click Count" climbing in real-time (10... 50... 100...) rather than waiting for the final number at the end of the hour.
Note: These results are speculative. They will likely change as more data arrives.
B. On-Time Triggers (Completeness Results)
Definition: Firing exactly when the Watermark passes the end of the window.
Logic: "The Watermark for 12:05 has arrived. Close the 12:00-12:05 bucket and emit."
Goal: Completeness.
Use Case: This is the equivalent of a classic Batch job. It represents the system's "best guess" at the final, correct answer.
C. Late Triggers (Correction Results)
Definition: Firing after the Watermark has passed, whenever late data arrives.
Logic: "I know we closed the 12:00 window 10 minutes ago, but a new event just arrived. Emit an updated result."
Goal: Correctness.
Use Case: Ensuring data accuracy despite network lag. Without late triggers, any data arriving after the Watermark (the "Heuristic Error") would be silently dropped.
Composition (Mixing Them)
The power of the Dataflow model is that you don't choose one; you use them together.
Example Strategy: "Give me estimates every minute (Early), give me a final tally when you think we're done (On-Time), and send me a correction if anything changes afterwards (Late)."
The "Fire and Purge" vs. "Fire and Accumulate"
When a trigger fires, what happens to the data in the window?
Fire and Purge: Emit the result and delete the window content. (Useful if you just want a stream of deltas).
Fire and Accumulate: Emit the result but keep the window open to add future data to it. (Required if you want "Running Totals").
This leads us directly to the final piece of the puzzle. If I emit a result at 12:01 (Early: 10), and another at 12:05 (On-Time: 25), and another at 12:10 (Late: 26)... how does the downstream database understand this sequence?
Does 25 replace 10? Or should I add 25 to 10? This is the domain of Accumulation, which we will discuss next.
Last updated