Unbounded data: Streaming


Unbounded Data: Streaming

Contrary to the ad-hoc nature of batch processing (which tries to force infinite data into finite boxes), streaming systems are architecturally built for unbounded data. However, gaining low latency requires handling the raw chaos of the real world.

The Two Characteristics of Real-World Data When you move away from Batch, you lose the ability to "sort" the file before processing. You must deal with data as it arrives, which is usually:

  1. Highly Unordered: Data rarely arrives in the order it was generated (due to network paths, pauses, etc.). To analyze it correctly, you often need a "time-based shuffle" to reorganize it.

  2. Varying Event-Time Skew: You cannot assume that data for 12:00 will arrive by 12:05. The delay (skew) is variable.


Now that we know data is infinite and time is skewed, what are the actual ways we can process it?

Processing Strategies Given these characteristics (infinite, unordered, skewed), there are four main approaches to processing this data. They range from simple (ignoring time) to complex (reasoning about skew).

  1. Time-Agnostic: Processing that doesn't care about time (e.g., Filtering, Mapping).

  2. Approximation: Algorithms that give rough answers with low latency (e.g., Top-N, Approximate Counts).

  3. Windowing by Processing Time: Chopping the stream based on when data arrives. Simple, but inaccurate if skew exists.

  4. Windowing by Event Time: Chopping the stream based on when data happened. Complex (requires Watermarks), but accurate.



Last updated