Out-of-order events

Excellent question! Out-of-order data is one of the fundamental challenges in stream processing. Let me explain what it is, why it happens, and why it matters so much.

What is Out-of-Order Data?

Out-of-order data occurs when events arrive at your processing system in a different sequence than the order in which they actually occurred (based on event time).

Simple Example

Events occur (event time):
Event A: 12:00:15
Event B: 12:00:30  
Event C: 12:00:45
Event D: 12:01:00

Events arrive at system (processing time):
12:05:00 → Event B (12:00:30)  ← Arrives first, but happened second
12:05:01 → Event D (12:01:00)  ← Arrives second
12:05:02 → Event A (12:00:15)  ← Arrives third, but happened FIRST!
12:05:03 → Event C (12:00:45)  ← Arrives last

This is out-of-order!

In order: Event time matches arrival order Out of order: Event time does NOT match arrival order

Why Does Out-of-Order Data Happen?

1. Network Variability

Different network paths have different latencies:

# Two users click at same time
User A in New York: 
  Click at 12:00:00 → 50ms network → Arrives 12:00:00.050

User B in Tokyo:
  Click at 12:00:00 → 250ms network → Arrives 12:00:00.250

# But User B's event has earlier event time!
User B again:
  Click at 11:59:58 → 250ms network → Arrives 12:00:00.248

# Arrival order: A (12:00:00), B (11:59:58)  ← Out of order!

2. Mobile/Offline Devices

Mobile devices buffer events when offline:

# User's phone goes offline
11:30 - User opens app (offline) → event buffered locally
11:35 - User clicks button (offline) → event buffered locally
11:40 - User submits form (offline) → event buffered locally

# Phone reconnects at 14:00
14:00 - All buffered events uploaded at once!
14:00:01 → Event from 11:30 arrives
14:00:01 → Event from 11:35 arrives  
14:00:01 → Event from 11:40 arrives

# These are 2.5 hours late!

3. Distributed Systems / Multiple Data Centers

Events from different sources merge:

# Source A (fast):
Event A1: 12:00:00 → arrives 12:00:01

# Source B (slow, experiencing backpressure):
Event B1: 12:00:00 → arrives 12:00:30  (30 seconds late)

# Even though B1 happened at same time as A1,
# it arrives much later → out of order relative to later events from A

4. Clock Skew

Different machines have slightly different clocks:

# Server 1 (clock 2 seconds fast):
Event from Server 1: timestamp = 12:00:02 (but really 12:00:00)

# Server 2 (clock accurate):
Event from Server 2: timestamp = 12:00:01

# If Server 2's event arrives first:
# We see: 12:00:01, then 12:00:02 (in order!)
# But if Server 1's event arrives first:
# We see: 12:00:02, then 12:00:01 (out of order!)

5. Parallel Processing Pipelines

Different processing speeds:

# Fast lane (simple processing):
Event X: 12:00:00 → processed in 100ms → arrives 12:00:00.1

# Slow lane (complex processing with database lookups):
Event Y: 11:59:58 → processed in 5 seconds → arrives 12:00:03

# Y happened first but arrives last!

6. Retries and Replays

Failed deliveries retry later:

# First attempt:
12:00:00 - Event A sent → network failure → lost

# Events B, C, D succeed:
12:00:05 - Event B sent → arrives
12:00:10 - Event C sent → arrives
12:00:15 - Event D sent → arrives

# Retry of A:
12:00:20 - Event A retried → arrives (20 seconds late!)

# Arrival: B, C, D, A (but event times: A, B, C, D)

Visual Representation

Why Out-of-Order Data Is Problematic

Problem 1: Wrong Window Assignment (with Processing-Time Windows)

# Using processing-time windows [12:00-12:10]

Event A occurs at 12:05, arrives at 12:11
→ Goes to window [12:10-12:20] ❌ (WRONG!)

Event B occurs at 12:08, arrives at 12:09  
→ Goes to window [12:00-12:10] ✓ (correct)

# Result: Event A is counted in the wrong time period
# Your hourly report for 12:00-12:10 is missing Event A
# Your report for 12:10-12:20 incorrectly includes Event A

Problem 2: Incorrect Aggregations

# Counting events per minute using processing-time

Reality (event times):
12:00 - 5 events
12:01 - 8 events  
12:02 - 3 events

What system sees (arrival order scrambled):
12:00 window: 3 events (missing 2 that arrive late)
12:01 window: 6 events (has 2 from 12:00, missing 2 from 12:01)
12:02 window: 7 events (has 2 late ones from 12:01, plus correct 3, plus 2 from future)

# All windows have wrong counts!

Problem 3: Ordering-Dependent Operations Break

# Example: Session detection based on arrival order

Actual sequence (event times):
11:00 - Login
11:05 - View page A
11:10 - Add to cart
11:15 - Checkout

Arrival sequence:
11:10 - Add to cart      ← Arrives first!
11:00 - Login           ← Arrives second
11:15 - Checkout        ← Arrives third  
11:05 - View page A     ← Arrives last

# If you process in arrival order:
# Session looks like: cart → login → checkout → page view
# Makes no sense! User can't add to cart before logging in

Problem 4: Impossible to Know "Completeness"

# Window [12:00-12:10], current time is 12:15

# You've seen events with timestamps:
# 12:00:15, 12:01:30, 12:03:45, 12:05:20, 12:08:10

# Questions:
# - Have I seen ALL events from 12:00-12:10?
# - Is there a mobile user offline with events from 12:02?
# - Did a network partition delay events from 12:07?

# Without watermarks: YOU DON'T KNOW! ⚠️

How Streaming Systems Handle Out-of-Order Data

Approach 1: Ignore It (Processing-Time Windows)

# Just process based on arrival
# Fast and simple, but incorrect for event-time analysis

Pros: Simple Cons: Wrong results

Approach 2: Buffer and Sort (Limited Reordering)

# Buffer events for N seconds, sort by event time, then process
buffer = []
buffer_duration = 5  # seconds

def process():
    wait(5_seconds)
    sorted_events = sorted(buffer, key=lambda e: e.event_time)
    for event in sorted_events:
        handle(event)

Pros: Handles small amounts of disorder Cons: Doesn't handle large delays, adds latency

Approach 3: Watermarks + Late Data Handling (Modern Approach)

# Use watermarks to estimate completeness
# Allow late data to trigger corrections

stream \
  .window(TumblingEventTimeWindows.of(10_minutes)) \
  .withWatermark(lag=5_minutes) \
  .trigger(
    AfterWatermark()
      .withLateFirings(AfterCount(1))
  ) \
  .accumulating()

# Timeline for window [12:00-12:10]:
# 12:15 - Watermark reaches 12:10 → emit "final" result
# 12:18 - Late event from 12:09 arrives → emit correction

Pros: Handles arbitrary delays, correct results Cons: More complex

Measuring Out-of-Order-ness

You can quantify how out-of-order your data is:

# Event-time skew: How far back in time do events arrive?

class SkewMonitor:
    def __init__(self):
        self.max_event_time_seen = None
    
    def observe(self, event):
        if self.max_event_time_seen is None:
            self.max_event_time_seen = event.event_time
        else:
            skew = self.max_event_time_seen - event.event_time
            if skew > 0:
                print(f"Out of order by {skew} seconds")
            self.max_event_time_seen = max(
                self.max_event_time_seen, 
                event.event_time
            )

# Output might show:
# Out of order by 2 seconds
# Out of order by 0 seconds (in order)
# Out of order by 45 seconds  
# Out of order by 1800 seconds (30 minutes!)

Real-World Example: E-commerce Analytics

# Scenario: Calculate sales per hour

# Reality:
# 14:00-15:00 → $10,000 in sales (100 orders)

# Problem: Orders arrive out of order
# - 90 orders arrive within 5 minutes (on time)
# - 5 orders arrive 10 minutes late (mobile checkout completed offline)
# - 3 orders arrive 2 hours late (payment gateway retry)
# - 2 orders never arrive (lost in network)

# Processing-time approach:
# 14:00-15:00 window closes at 15:05
# Count: 90 orders, $9,000 ❌ (missing $1,000)

# Event-time with watermarks:
# 14:00-15:00 window closes at 15:05 (5 min watermark lag)
# Initial: 90 orders, $9,000
# 15:10: +5 orders, updated to $9,500
# 17:00: +3 orders, updated to $9,800 (closer to truth)
# Final: 98 orders, $9,800 (still missing 2, but much better)

Takeaways from this

Out-of-order data is inevitable in distributed systems. The question isn't "how do I prevent it?" (you can't), but rather:

Do I care about event time?
- Yes → Must handle out-of-order data (watermarks, buffering)
- No → Can use processing-time (simpler but incorrect for temporal analysis)
How much disorder do I expect?
- Milliseconds → Simple buffering might suffice
- Hours → Need watermarks with aggressive late-data handling
What's my tolerance for late data?
- Low → Tight watermarks, few corrections
- High → Loose watermarks, many corrections

Out-of-order data is why modern streaming frameworks like Flink, Beam, and Kafka Streams have such sophisticated time-handling mechanisms. It's the core challenge that makes stream processing hard!

PreviousProcessing Strategies NextWatermarks

Last updated 1 month ago

hashtagWhat is Out-of-Order Data?

hashtagSimple Example

hashtagWhy Does Out-of-Order Data Happen?

hashtag1. Network Variability

hashtag2. Mobile/Offline Devices

hashtag3. Distributed Systems / Multiple Data Centers

hashtag4. Clock Skew

hashtag5. Parallel Processing Pipelines

hashtag6. Retries and Replays

hashtagVisual Representation

hashtagWhy Out-of-Order Data Is Problematic

hashtagProblem 1: Wrong Window Assignment (with Processing-Time Windows)

hashtagProblem 2: Incorrect Aggregations

hashtagProblem 3: Ordering-Dependent Operations Break

hashtagProblem 4: Impossible to Know "Completeness"

hashtagHow Streaming Systems Handle Out-of-Order Data

hashtagApproach 1: Ignore It (Processing-Time Windows)

hashtagApproach 2: Buffer and Sort (Limited Reordering)

hashtagApproach 3: Watermarks + Late Data Handling (Modern Approach)

hashtagTakeaways from this