Accumulation


Accumulation: Managing Result Refinement

The Core Problem:

Because we are using Triggers, a single window (e.g., 12:00–12:05) might produce multiple outputs over time.

  • Output 1 (Early): Count = 10

  • Output 2 (On-Time): Count = 25

  • Output 3 (Late): Count = 28

The Question: How should the downstream system (the database or dashboard consuming this stream) interpret these numbers?

  • Should it add 2525 to 1010? (Total = 35? Wrong)

  • Should it replace 1010 with 2525? (Total = 25? Correct)

The Solution: Accumulation Modes

Accumulation defines the "semantics of refinement." It tells the system how the current result relates to the previous results for the same window.

The Three Accumulation Modes

A. Discarding (Deltas)

  • Logic: Every time the trigger fires, the system calculates the result for the new data only, emits it, and then throws away the state.

  • Message: "Here is a chunk of new data. You figure out the total."

  • Sequence: 10 ... 15 (meaning +15 new ones) ... 3 (meaning +3 late ones).

  • Use Case: When the downstream system is an aggregator itself (e.g., sending data to a metric store that sums everything up).

B. Accumulating (Snapshots)

  • Logic: The system retains the state. Every time the trigger fires, it calculates the current total of everything seen so far.

  • Message: "The current total is now X." (Overwrite the old value).

  • Sequence: 10 ... 25 (10+15) ... 28 (25+3).

  • Use Case: Overwrite-capable storage like Key/Value stores (BigTable, DynamoDB, Redis). You just PUT the new value, overwriting the old 10 with 25.

C. Accumulating & Retracting (The "Undo" Button)

  • Logic: The system provides the new total, but also provides a "retraction" for the previous total.

  • Message: "I previously told you the total was 10. I was wrong. Subtract 10, and add 25."

  • Sequence:

    1. Emit 10

    2. Emit Retract -10, Emit 25

    3. Emit Retract -25, Emit 28

  • Use Case: Complex downstream systems that perform grouping or summing (like SQL). If you just sent 25, a downstream SUM operation might add 10 + 25 = 35 (double counting). The retraction ensures the math stays clean: 10 - 10 + 25 = 25.


Comparison Table: processing the sequence [1, 3, 2]

Imagine a window receives three events with values 1, 3, and 2. A trigger fires after every single event.

Event Arrives

Discarding Mode Output

Accumulating Mode Output

Acc. & Retracting Mode Output

Input: 1

1

1

1

Input: 3

3

4 (1+3)

-1, 4

Input: 2

2

6 (4+2)

-4, 6

Final View

Consumer must sum: 1+3+2=61+3+2 = 6

Consumer sees latest: 66

Consumer sums all: 1+(1+4)+(4+6)=61 + (-1 + 4) + (-4 + 6) = 6


Summary of the "What, Where, When, How" Framework

You have now documented the complete architectural solution for Unbounded Data:

  1. What: Transformations (The Math).

  2. Where: Event Time Windowing (The Buckets).

  3. When: Watermarks (Completeness) + Triggers (Latency/Correctness).

  4. How: Accumulation (Refinement Semantics).

This separates the Logical definition of your data pipeline from the Physical execution, solving the "Batch Coupling" problem we identified at the start.


Last updated