Processing Strategies


The Four Approaches Framework

This framework categorizes streaming processing strategies into four groups, representing a hierarchy of increasing sophistication in handling time and correctness.

1. Time-Agnostic

Definition: Operations that depend purely on the attributes of the data, ignoring temporal relationships or ordering. The computation produces the same result regardless of when events arrive or when they occurred.

  • Mechanism: Process events immediately as they arrive. No reordering, no waiting.

  • Example: "Filter out all logs where TEMP > 100°C."

  • Pros: Extremely simple; high throughput; stateless (mostly).

  • Cons: Cannot answer "When?" questions or handle trends.

The Spectrum of Time-Agnosticism

  • Purely Time-Agnostic: Filtering, Mapping (Transformations), FlatMap.

    • Note: Events arriving 12:01, 12:00, 12:02 are output in that exact same order.

  • Mostly Time-Agnostic (The Inner Join):

    • Concept: You only emit a result when both sides of the join match.

    • Why it's Agnostic: You don't need to wait for a specific time window to close. You just buffer indefinitely until the match arrives.

    • The Garbage Collection Caveat: While semantically agnostic, you can't buffer forever. In practice, you need a TTL (Time-To-Live) to clean up unmatched state, which technically re-introduces a time constraint.

  • The Outer Join Exception (Time-Aware):

    • Contrast: A LEFT OUTER JOIN must emit a result even if the right side never comes.

    • The Problem: How long do you wait? You implicitly need a timeout (a Window). Therefore, Outer Joins are always Time-Aware / Windowed.


2. Approximation

Definition: Algorithms that provide "good enough" answers using bounded resources (memory/CPU) on unbounded data. They trade Accuracy for Speed and Efficiency.

  • Philosophy: "If you squint, it looks correct."

  • The Time Element: Many of these algorithms have a built-in "decay" factor (fading out old data), which is usually based on Processing Time (arrival).

Common Algorithms:

Algorithm

Problem Solved

Trade-off

Approximate Top-N

"Who are the heaviest users?"

Uses fixed memory (e.g., store top 1000) but might miss a new fast-rising user.

Streaming K-Means

"Cluster these user behaviors."

Updates cluster centers incrementally; old data decays over time.

HyperLogLog

"Count unique visitors."

Uses ~1.5 KB to count billions of items with ±2% error vs. storing every ID (unbounded RAM).

Count-Min Sketch

"Frequency estimation."

Probabilistic data structure that acts like a hash table to count frequency with fixed memory.

When to use:

  • ✅ Dashboards, Trending Topics, Anomaly Detection, DDoS protection.

  • ❌ Billing, Auditing, Financial Transactions (where 100 != 99.8).


3. Windowing by Processing Time

Definition: Grouping events based on the wall-clock time they arrived at the system, ignoring when they actually happened.

  • Example: "Count all events received by the server between 2:00 PM and 2:05 PM."

  • Mechanism: Buffer data for 5 minutes. When the system clock hits 2:05, emit the result.

  • Pros:

    • Completeness is trivial: You know exactly when the window ends (because you own the clock).

    • Simple: No buffering for late data.

  • Cons:

    • Incorrectness: If data generated at 1:55 arrives at 2:02 due to lag, it is counted in the 2:00 bucket. Your analytics now reflect network speed, not user behavior.


4. Windowing by Event Time

Definition: Grouping events based on the timestamp of when they actually occurred. This is the "Gold Standard" for correctness.

  • Example: "Count all events that happened between 2:00 PM and 2:05 PM."

  • Mechanism: Even if data arrives at 4:00 PM, the system places it back into the 2:00 PM bucket.

  • Pros:

    • Correctness: Results reflect reality, regardless of network skew or outages.

  • Cons:

    • The Completeness Problem: Since you don't own the event clock (the user does), how do you know if you have seen all the events for 2:00 PM? A user might still be offline.


Comparison diagram


Conclusion: The Central Tension

This framework reveals the core challenge in stream processing: correctness versus completeness.

  • Processing-time windows give you completeness (you know when the window closes) but sacrifice correctness (events are in wrong windows due to lag)

  • Event-time windows give you correctness (events are properly grouped by when they occurred) but create a completeness problem (when can you finalize results?)

The rest of any good streaming systems discussion focuses on solving this tension—how do you achieve correct event-time processing while handling the inherent incompleteness of unbounded, unordered, variably-skewed data?

This is where concepts like watermarks (estimates of progress in event time), triggers (when to emit results), and accumulation modes (how to refine results) come into play.


Last updated