Exactly-Once Processing

This chapter tackles the most controversial topic in streaming. For a long time, the industry believed true "Exactly-Once" was impossible in distributed systems (due to the Two Generals' Problem). The book argues that not only is it possible, but it is also necessary for correctness.


Exactly-Once Processing (Correctness)

The Core Problem:

In a distributed system, machines fail. When a machine fails, you must retry the work.

  • At-Least-Once: If you retry, you might process the same record twice. (Result: Over-counting revenue).

  • At-Most-Once: If you don't retry safely, you might drop the record. (Result: Missing revenue).

  • Exactly-Once: You process the data, handle failures, and the final result is as if no failure ever occurred.

The Reality: "Effectively-Once"

The book clarifies a critical misconception: "Exactly-Once" does not mean the code runs exactly once. The code might run 5 times due to retries!

  • The Definition: Exactly-Once means the effect on the persistent state is applied exactly once.

  • The Mechanism: It relies on Idempotency and Atomic Commits.

The Mechanics of Correctness

How do systems like Google Cloud Dataflow or Apache Flink achieve this? They don't rely on magic; they rely on heavy bookkeeping.

A. Unique Identifiers (The Fingerprint)

Every single record entering the system is assigned a deterministic, unique ID.

  • If a record is re-sent due to a network glitch, it carries the same ID.

B. The Shuffle (The Hard Part)

The most dangerous moment is the Shuffle (grouping data).

  • Scenario: Worker A sends data to Worker B. Worker B processes it but crashes before saying "Ack."

  • Retry: Worker A sends it again. Worker B restarts and receives it again.

  • Solution: Worker B maintains a deduplication log (a Bloom Filter or Key-Value store) of all IDs it has successfully processed. Before processing anything, it checks: "Have I seen ID #123 before?"

    • If Yes: Ignore it (it's a retry).

    • If No: Process it.

C. Atomic State Updates

Processing a record usually involves two things:

  1. Updating the result (e.g., Count = Count + 1).

  2. Updating the position in the stream (e.g., Offset = 100).

To guarantee correctness, these two updates must happen Atomically (all or nothing). Modern streaming engines use consistent storage (like RocksDB in Flink) to ensure that if the machine crashes, both the Count and the Offset roll back together.

Comparison of Guarantees

Guarantee

Behavior on Failure

Result

At-Most-Once

"Fire and Forget." Drop data if busy/broken.

Under-counting. (Data Loss)

At-Least-Once

"Retry until Ack." Duplicates are possible.

Over-counting. (Double counting)

Exactly-Once

"Retry, but filter duplicates."

Correctness.

Visualization: The Barrier / Checkpoint

The most common way to visualize this is through "Checkpoints" (used heavily in Flink) or "Atomic Batches" (Spark Streaming).

How to read this visualization:

  • You see a stream of data.

  • Periodically, a vertical bar (The Barrier/Checkpoint) flows with the data.

  • When the Barrier reaches an operator (e.g., a "Sum" function), the operator freezes its state (takes a snapshot) to a disk.

  • If the system crashes, it rewinds strictly to the last Barrier. It doesn't replay the duplicate events after the barrier; it replays the source from that point.

Side Effects (The Limit)

The book adds a crucial warning: The system can only guarantee Exactly-Once inside its own walled garden.

  • Internal: Calculating a sum? Yes, Exactly-Once.

  • External: Sending an email? No.

    • If the system sends an email and then crashes before saving the state, it will restart and send the email again.

    • You cannot "un-send" an email. Side effects are inherently "At-Least-Once."


Summary

  • Marketing Term: "Exactly-Once."

  • Engineering Reality: "Effectively-Once" (Retries + Deduplication).

  • Requirement: Unique IDs for every record + Atomic State commits.

  • Limitation: Does not apply to external side effects (like sending emails or calling external APIs) unless those APIs are also idempotent.


Last updated