# Processing Strategies

***

### The Four Approaches Framework

This framework categorizes streaming processing strategies into four groups, representing a hierarchy of increasing sophistication in handling time and correctness.

**1. Time-Agnostic**

Definition: Operations that depend purely on the attributes of the data, ignoring temporal relationships or ordering. The computation produces the same result regardless of when events arrive or when they occurred.

<figure><img src="https://2332658533-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FG5fhKjYnbaQlTPTcaO85%2Fuploads%2Fa2MWLszr58O1aCn11Egj%2FGemini_Generated_Image_s2x9jls2x9jls2x9.png?alt=media&#x26;token=f1760a39-48f4-44ab-b752-91606373e67a" alt=""><figcaption></figcaption></figure>

* Mechanism: Process events immediately as they arrive. No reordering, no waiting.
* Example: "Filter out all logs where `TEMP > 100°C`."
* Pros: Extremely simple; high throughput; stateless (mostly).
* Cons: Cannot answer "When?" questions or handle trends.

The Spectrum of Time-Agnosticism

* Purely Time-Agnostic: Filtering, Mapping (Transformations), FlatMap.
  * *Note:* Events arriving `12:01`, `12:00`, `12:02` are output in that exact same order.
* Mostly Time-Agnostic (The Inner Join):
  * Concept: You only emit a result when *both* sides of the join match.
  * Why it's Agnostic: You don't need to wait for a specific time window to close. You just buffer indefinitely until the match arrives.
  * The Garbage Collection Caveat: While semantically agnostic, you can't buffer forever. In practice, you need a TTL (Time-To-Live) to clean up unmatched state, which technically re-introduces a time constraint.
* The Outer Join Exception (Time-Aware):
  * Contrast: A `LEFT OUTER JOIN` *must* emit a result even if the right side never comes.
  * The Problem: How long do you wait? You implicitly need a timeout (a Window). Therefore, Outer Joins are always Time-Aware / Windowed.

***

**2. Approximation**

Definition: Algorithms that provide "good enough" answers using bounded resources (memory/CPU) on unbounded data. They trade Accuracy for Speed and Efficiency.

<figure><img src="https://2332658533-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FG5fhKjYnbaQlTPTcaO85%2Fuploads%2FFAGHbvCurxiuMAZNXuFZ%2FGemini_Generated_Image_5cj8ss5cj8ss5cj8.png?alt=media&#x26;token=bd21fa6d-4b82-46d4-bd10-5d3dc27d3164" alt=""><figcaption></figcaption></figure>

* Philosophy: "If you squint, it looks correct."
* The Time Element: Many of these algorithms have a built-in "decay" factor (fading out old data), which is usually based on Processing Time (arrival).

Common Algorithms:

| **Algorithm**     | **Problem Solved**              | **Trade-off**                                                                                  |
| ----------------- | ------------------------------- | ---------------------------------------------------------------------------------------------- |
| Approximate Top-N | "Who are the heaviest users?"   | Uses fixed memory (e.g., store top 1000) but might miss a new fast-rising user.                |
| Streaming K-Means | "Cluster these user behaviors." | Updates cluster centers incrementally; old data decays over time.                              |
| HyperLogLog       | "Count unique visitors."        | Uses \~1.5 KB to count billions of items with ±2% error vs. storing every ID (unbounded RAM).  |
| Count-Min Sketch  | "Frequency estimation."         | Probabilistic data structure that acts like a hash table to count frequency with fixed memory. |

When to use:

* ✅ Dashboards, Trending Topics, Anomaly Detection, DDoS protection.
* ❌ Billing, Auditing, Financial Transactions (where `100 != 99.8`).

***

**3. Windowing by Processing Time**

Definition: Grouping events based on the wall-clock time they arrived at the system, ignoring when they actually happened.

<figure><img src="https://2332658533-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FG5fhKjYnbaQlTPTcaO85%2Fuploads%2FTlmSmBw0zqZxjSSj2GJj%2FGemini_Generated_Image_6jw1is6jw1is6jw1.png?alt=media&#x26;token=c33fd068-4b00-469e-ae8c-6cb11fce775a" alt=""><figcaption></figcaption></figure>

* Example: "Count all events received by the server between 2:00 PM and 2:05 PM."
* Mechanism: Buffer data for 5 minutes. When the system clock hits 2:05, emit the result.
* Pros:
  * Completeness is trivial: You know exactly when the window ends (because you own the clock).
  * Simple: No buffering for late data.
* Cons:
  * Incorrectness: If data generated at 1:55 arrives at 2:02 due to lag, it is counted in the 2:00 bucket. Your analytics now reflect network speed, not user behavior.

***

**4. Windowing by Event Time**

Definition: Grouping events based on the timestamp of when they actually occurred. This is the "Gold Standard" for correctness.

<figure><img src="https://2332658533-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FG5fhKjYnbaQlTPTcaO85%2Fuploads%2FrL1kJvJMkgRG52Liu0ti%2FGemini_Generated_Image_ipa31dipa31dipa3.png?alt=media&#x26;token=55e4c5d5-b896-4f96-b122-d7f8a5d184e6" alt=""><figcaption></figcaption></figure>

* Example: "Count all events that happened between 2:00 PM and 2:05 PM."
* Mechanism: Even if data arrives at 4:00 PM, the system places it back into the 2:00 PM bucket.
* Pros:
  * Correctness: Results reflect reality, regardless of network skew or outages.
* Cons:
  * The Completeness Problem: Since you don't own the event clock (the user does), how do you know if you have seen *all* the events for 2:00 PM? A user might still be offline.

***

**Comparison diagram**

<figure><img src="https://2332658533-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FG5fhKjYnbaQlTPTcaO85%2Fuploads%2FcedFrsNJW5QX7HUClZnr%2FGemini_Generated_Image_8kp69i8kp69i8kp6.png?alt=media&#x26;token=4bffa46d-b1ca-4120-877e-f3b1b8c1813c" alt=""><figcaption></figcaption></figure>

***

#### Conclusion: The Central Tension

This framework reveals the core challenge in stream processing: **correctness versus completeness**.

* **Processing-time windows** give you completeness (you know when the window closes) but sacrifice correctness (events are in wrong windows due to lag)
* **Event-time windows** give you correctness (events are properly grouped by when they occurred) but create a completeness problem (when can you finalize results?)

The rest of any good streaming systems discussion focuses on solving this tension—how do you achieve **correct event-time processing** while handling the **inherent incompleteness** of unbounded, unordered, variably-skewed data?

This is where concepts like **watermarks** (estimates of progress in event time), **triggers** (when to emit results), and **accumulation modes** (how to refine results) come into play.

***
