Unbounded data: Batch


Data Processing patterns definitions

  • Unbounded data: Continuous, infinite streams of data (e.g., user clicks, sensor readings, log events) that have no defined end.

  • Bounded data: Finite datasets with a clear beginning and end (traditional batch processing).

  • Window: A finite slice of data along temporal boundaries. Windows allow us to compute aggregations over subsets of infinite streams.

  • Fixed windows (also called tumbling windows): Non-overlapping, consistent-duration time intervals. Each event belongs to exactly one window.

  • Sliding windows (also called hopping windows): Fixed-duration windows that overlap. Defined by a window size (duration) and a period (frequency). Because they overlap, a single event belongs to multiple windows (e.g., a "last 1 hour" window updated every 5 minutes).

  • Sessions: Dynamic windows that capture periods of activity separated by gaps of inactivity. Sessions end after a timeout period with no new events.


Batch Processing of Unbounded Data

When using batch systems to process streaming data, you typically accumulate data for some period, then run a batch job. Let me show you how this works for different windowing strategies:

  1. Fixed Windows in Batch

Fixed windows divide time into consistent, non-overlapping intervals. In the diagram, I'm using 2-minute windows:

  • Window 1 (12:00-12:02): Contains 3 events

  • Window 2 (12:02-12:04): Contains 2 events

  • Window 3 (12:04-12:06): Contains 3 events

  • Window 4 (12:06-12:08): Empty (no events)

  • Window 5 (12:08-12:10): Contains 2 events

Key properties:

  • Every window has the same duration

  • Windows don't overlap

  • Every event belongs to exactly one window

  • Simple to implement and reason about

Batch processing approach: Collect data for some period (say, 10 minutes), then run a batch job that assigns each event to its appropriate fixed window based on event time, computing aggregations for each window.

  1. Sliding Windows in Batch

Sliding windows (also called Hopping windows) are fixed-duration windows that overlap. They are defined by two parameters: the Window Size (how long the window lasts) and the Window Period (how frequently a new window starts).

In the diagram (using a 4-minute Window Size and a 2-minute Period):

  • Window 1 (12:00–12:04): Captures all events in these 4 minutes.

  • Window 2 (12:02–12:06): Starts 2 minutes later. Notice it overlaps with Window 1. Events occurring between 12:02 and 12:04 belong to both Window 1 and Window 2.

  • Window 3 (12:04–12:08): Starts 2 minutes later. Overlaps with Window 2.

  • Result: Every single event belongs to multiple windows (specifically: Size / Period = 4 / 2 = 2 windows per event).

Key properties:

  • Fixed duration: Every window is the exact same length (unlike Sessions).

  • Overlapping: Windows share data. This is useful for "moving averages" (e.g., "Give me the average CPU load of the last 5 minutes, updated every 10 seconds").

  • Deterministic: You know exactly when windows start and end beforehand (unlike Sessions).

  • More expensive: Because every event must be processed/stored multiple times (once for every window it falls into).

Batch processing approach: The batch job iterates through the dataset. For every event found at time T, the system must mathematically calculate which open windows cover time T and duplicate the event into all of them. For example, an event at 12:03 is added to the 12:00–12:04 bucket AND the 12:02–12:06 bucket.

  1. Sessions in Batch

Sessions are much more dynamic. They capture periods of activity separated by inactivity gaps. A session continues as long as events keep arriving within the timeout period.

In the diagram (using a 1-minute timeout):

  • Session A: Spans 80 seconds with 3 events clustered together

  • Gap: More than 1 minute of inactivity

  • Session B: A brief 30-second burst with 2 events

  • Gap: Inactivity period

  • Session C: 70 seconds of activity with 3 events

  • Long gap: Extended inactivity

  • Session D: Another brief session with 2 events

Key properties:

  • Variable duration based on activity patterns

  • Defined by gaps (timeout periods) rather than fixed boundaries

  • Excellent for modeling user behavior

  • More complex to compute

Batch processing approach: After collecting data, the batch job sorts events by time, then iterates through them identifying session boundaries whenever it encounters a gap exceeding the timeout threshold.


The Batch Limitation

All three approaches share a fundamental constraint in batch mode: high latency. You must wait to accumulate sufficient data before running the batch job. If you run hourly batches, your results are at least an hour delayed. This is acceptable for analytics but problematic for real-time applications like fraud detection or monitoring. Batch systems naturally prioritize Completeness (waiting for all data) over Latency.

Concrete Example of Batch latency

Let's say you're tracking website clicks:

Batch approach with hourly windows:

  • Clicks happen: 2:00 PM - 3:00 PM

  • Wait for late data: Until 3:30 PM

  • Run batch job: 3:30 PM - 3:45 PM (15 minutes of processing)

  • Results available: 3:45 PM

Total latency: 45-105 minutes after the first click!

Streaming approach:

  • Clicks happen: 2:00 PM - 3:00 PM

  • Process continuously in real-time

  • Preliminary results: Available within seconds

  • Refined results: Update as late data arrives

Total latency: Seconds to minutes


Why Streaming Systems Exist

This latency problem is exactly why streaming systems were developed. They process events continuously as they arrive, providing:

  • Low-latency results (seconds instead of minutes/hours)

  • Continuous updates (results refine as new data arrives)

  • Ability to handle late data (without waiting for everything upfront)

The trade-off is complexity: streaming systems must handle incomplete data, out-of-order events, and provide mechanisms (like watermarks and triggers) to decide when to emit results.

Streaming systems are architecturally designed for unbounded data, rather than trying to force infinite streams into finite batches.


The Trade-off Triangle of data processing systems

The CAP Theorem for Data Processing (The Trade-off Triangle) Just like distributed databases have the CAP theorem, data processing systems have a "Triangle of Trade-offs." You generally can't optimize all three simultaneously:

  1. Latency: How fast do I get results?

  2. Completeness: How accurate is the result? (Did we count every record, even the late ones?)

  3. Cost: How much computing power/money does this consume?

Where Batch fits:

  • Batch prioritizes Completeness and Cost.

  • It waits for all data (high completeness) and processes it efficiently in bulk (low cost), but sacrifices Latency (you wait hours for the answer).

Where Traditional Streaming fits:

  • Streaming prioritizes Latency.

  • It gives answers immediately. To keep cost low, it often sacrificed Completeness (giving approximate answers).

  • Modern Streaming Goal: To give you low Latency and high Completeness, but usually at a higher Cost.

    • Minor correction: Modern streaming systems can actually achieve comparable or better cost efficiency than batch for continuous workloads, because they eliminate the overhead of repeated startup/teardown and can optimize resource usage. The higher cost typically comes from infrastructure complexity and operational overhead rather than pure computational efficiency.


The Root Cause of Inefficiency of Batch systems for Unbounded data processing: The "Coupling" of Time

While the Latency trade-off is obvious, the deeper architectural flaw in Batch systems is the Coupling of Processing Time and Event Time.

In a traditional Batch job, the "Window" is often defined by the schedule of the job itself:

  • If you run a daily batch job, your "Window" is forced to be 24 hours.

  • If you want 30-minute windows, you are forced to run 48 tiny batch jobs a day.

This approach conflates Semantics (The definition of the window, e.g., "30 minutes") with Execution (The engine schedule, e.g., "Run a job"). Window semantics become dictated by operational scheduling rather than business logic.

The "Impedance Mismatch" This coupling becomes a nightmare when dealing with Sessions or Late Data:

  • A user session that crosses the midnight boundary gets split into two separate batch runs.

  • To fix this, the batch engineer has to write complex "spillover" logic to look back at yesterday's batch.

  • The logic for correctness becomes entangled with the logic for scheduling.


The Solution: Decoupling (The Dataflow Model)

To solve this, Streaming Systems proposes a new mental model. We must stop thinking about "Batch vs. Streaming" engines, and instead separate the logical definition of the data from the physical execution of the engine.

We do this by asking four distinct questions that decouple these concerns.

1. WHAT results are calculated? (Transformations)

  • Concept: The mathematical logic of the pipeline.

  • Examples: Sums, Counts, Machine Learning inference, Parsing JSON.

  • Decoupling: This is purely about the data transformation. It is time-agnostic. SUM(clicks) is the same math whether you run it every second or every year.

2. WHERE in event time are results calculated? (Windowing)

  • Concept: How the infinite stream is chopped into finite temporal chunks.

  • Examples: Fixed Windows (Hourly), Sliding Windows (Moving Averages), Session Windows (Activity-based).

  • Decoupling: This defines the window based on Event Time (when the data happened). It is completely independent of when the system processes it.

3. WHEN in processing time are results materialized? (Triggers & Watermarks)

  • Concept: The control mechanism for emitting results.

  • The Problem: Since the stream never ends, when do we say "The window is done"?

  • The Tools:

    • Watermarks: A declaration that "I believe all input for this window has arrived." (Completeness).

    • Triggers: A command to "Output the result now" (Latency). You can trigger early (speculative results), on-time (watermark pass), or late (corrections).

4. HOW do refinements of results relate? (Accumulation)

  • Concept: When you emit multiple results for the same window (e.g., a preliminary count followed by a final count), how should the consumer handle the update?

  • Modes:

    • Discarding: "Forget the last value, here is the new independent value."

    • Accumulating: "Here is the total so far (including the previous value)."

    • Retracting: "Subtract the old wrong value X, and add the new correct value Y."


Last updated