The Four Questions Framework for Stream Processing
The Four Questions Framework for Stream Processing
This is a brilliant conceptual framework used to structure thinking about unbounded data processing.
Architectural solution for Unbounded Data:
What: Transformations (The Math).
Where: Event Time Windowing (The Buckets).
When: Watermarks (Completeness) + Triggers (Latency/Correctness).
How: Accumulation (Refinement Semantics).
This separates the Logical definition of your data pipeline from the Physical execution, solving the "Batch Coupling" problem.
What - The Computation Itself
This is the most familiar question—it's about what you're actually computing. Are you summing values? Counting events? Building a histogram? Training an ML model? This is the realm of transformations and operations that traditional batch processing has always answered well. It's the "business logic" of your pipeline—the actual work you're trying to accomplish.
Key insight: This question alone isn't enough for streaming systems. Batch processing can answer "what" perfectly fine, but streaming requires us to be more precise about the temporal dimensions.
Where - Event Time Positioning
This question asks: where in the timeline of real-world events should we calculate results? This is answered through event-time windowing.
The critical distinction here is between when events actually happened versus when your system processes them. You might be:
Using fixed windows (hourly summaries of website traffic)
Using sliding windows (moving averages over the last 5 minutes)
Using session windows (grouping a user's activity bursts)
Operating with no windowing at all (time-agnostic processing, like classic batch)
Handling complex patterns (like auctions that close after a time limit)
The book notes that even processing-time windowing fits here—if you treat ingress time (when data arrives) as the event time, you're just using a different temporal reference point.
When - Materialization Timing
Here's where it gets interesting: when in processing time should you actually emit results? Just because you're computing hourly windows doesn't mean you output results exactly on the hour.
This is controlled by triggers and watermarks:
Repeated updates: Emit speculative results early and often, refining as more data arrives (materialized view semantics)
Watermark-driven: Wait until you believe all data for a window has arrived, then emit once (mimicking batch semantics on a per-window basis)
Hybrid approaches: Combine both patterns—early speculative results plus a final "complete" result
This separation of where (in event time) from when (in processing time) is one of the fundamental insights of stream processing.
How - Refinement Relationships
The final question: how do multiple results for the same window relate to each other? When you emit results multiple times, what's the relationship between them?
Three accumulation modes:
Discarding: Each result is independent. Think of them as deltas or incremental updates.
Accumulating: Each result is the complete, up-to-date value incorporating all previous data. Later results supersede earlier ones.
Accumulating & Retracting: You emit both the new complete value AND an explicit retraction of the previous value, allowing downstream systems to correct their state precisely.
Why this matters: The choice affects how downstream consumers must interpret and handle your results.
The Power of This Framework
What makes these four questions powerful is that they're orthogonal—you can mix and match answers independently. You might compute sums (what) over hourly windows (where) with early speculative triggers every minute (when) using accumulating mode (how). Or the same computation with completely different temporal semantics.
This framework transforms stream processing from an ad-hoc collection of techniques into a structured design space where you can reason clearly about your requirements and make deliberate choices.
Last updated