# Out-of-order events

Excellent question! Out-of-order data is one of the fundamental challenges in stream processing. Let me explain what it is, why it happens, and why it matters so much.

### What is Out-of-Order Data?

**Out-of-order data** occurs when events arrive at your processing system in a different sequence than the order in which they actually occurred (based on event time).

#### Simple Example

```
Events occur (event time):
Event A: 12:00:15
Event B: 12:00:30  
Event C: 12:00:45
Event D: 12:01:00

Events arrive at system (processing time):
12:05:00 → Event B (12:00:30)  ← Arrives first, but happened second
12:05:01 → Event D (12:01:00)  ← Arrives second
12:05:02 → Event A (12:00:15)  ← Arrives third, but happened FIRST!
12:05:03 → Event C (12:00:45)  ← Arrives last

This is out-of-order!
```

**In order**: Event time matches arrival order\
**Out of order**: Event time does NOT match arrival order

### Why Does Out-of-Order Data Happen?

#### 1. **Network Variability**

Different network paths have different latencies:

```python
# Two users click at same time
User A in New York: 
  Click at 12:00:00 → 50ms network → Arrives 12:00:00.050

User B in Tokyo:
  Click at 12:00:00 → 250ms network → Arrives 12:00:00.250

# But User B's event has earlier event time!
User B again:
  Click at 11:59:58 → 250ms network → Arrives 12:00:00.248

# Arrival order: A (12:00:00), B (11:59:58)  ← Out of order!
```

#### 2. **Mobile/Offline Devices**

Mobile devices buffer events when offline:

```python
# User's phone goes offline
11:30 - User opens app (offline) → event buffered locally
11:35 - User clicks button (offline) → event buffered locally
11:40 - User submits form (offline) → event buffered locally

# Phone reconnects at 14:00
14:00 - All buffered events uploaded at once!
14:00:01 → Event from 11:30 arrives
14:00:01 → Event from 11:35 arrives  
14:00:01 → Event from 11:40 arrives

# These are 2.5 hours late!
```

#### 3. **Distributed Systems / Multiple Data Centers**

Events from different sources merge:

```python
# Source A (fast):
Event A1: 12:00:00 → arrives 12:00:01

# Source B (slow, experiencing backpressure):
Event B1: 12:00:00 → arrives 12:00:30  (30 seconds late)

# Even though B1 happened at same time as A1,
# it arrives much later → out of order relative to later events from A
```

#### 4. **Clock Skew**

Different machines have slightly different clocks:

```python
# Server 1 (clock 2 seconds fast):
Event from Server 1: timestamp = 12:00:02 (but really 12:00:00)

# Server 2 (clock accurate):
Event from Server 2: timestamp = 12:00:01

# If Server 2's event arrives first:
# We see: 12:00:01, then 12:00:02 (in order!)
# But if Server 1's event arrives first:
# We see: 12:00:02, then 12:00:01 (out of order!)
```

#### 5. **Parallel Processing Pipelines**

Different processing speeds:

```python
# Fast lane (simple processing):
Event X: 12:00:00 → processed in 100ms → arrives 12:00:00.1

# Slow lane (complex processing with database lookups):
Event Y: 11:59:58 → processed in 5 seconds → arrives 12:00:03

# Y happened first but arrives last!
```

#### 6. **Retries and Replays**

Failed deliveries retry later:

```python
# First attempt:
12:00:00 - Event A sent → network failure → lost

# Events B, C, D succeed:
12:00:05 - Event B sent → arrives
12:00:10 - Event C sent → arrives
12:00:15 - Event D sent → arrives

# Retry of A:
12:00:20 - Event A retried → arrives (20 seconds late!)

# Arrival: B, C, D, A (but event times: A, B, C, D)
```

### Visual Representation

<figure><img src="https://2332658533-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FG5fhKjYnbaQlTPTcaO85%2Fuploads%2FvRKlsNi7uXcXakc5ghbr%2FBildschirmfoto%202025-11-05%20um%2017.55.20.png?alt=media&#x26;token=c730b11a-03d3-456a-8939-9616e996eab8" alt=""><figcaption></figcaption></figure>

***

### Why Out-of-Order Data Is Problematic

#### Problem 1: Wrong Window Assignment (with Processing-Time Windows)

```python
# Using processing-time windows [12:00-12:10]

Event A occurs at 12:05, arrives at 12:11
→ Goes to window [12:10-12:20] ❌ (WRONG!)

Event B occurs at 12:08, arrives at 12:09  
→ Goes to window [12:00-12:10] ✓ (correct)

# Result: Event A is counted in the wrong time period
# Your hourly report for 12:00-12:10 is missing Event A
# Your report for 12:10-12:20 incorrectly includes Event A
```

#### Problem 2: Incorrect Aggregations

```python
# Counting events per minute using processing-time

Reality (event times):
12:00 - 5 events
12:01 - 8 events  
12:02 - 3 events

What system sees (arrival order scrambled):
12:00 window: 3 events (missing 2 that arrive late)
12:01 window: 6 events (has 2 from 12:00, missing 2 from 12:01)
12:02 window: 7 events (has 2 late ones from 12:01, plus correct 3, plus 2 from future)

# All windows have wrong counts!
```

#### Problem 3: Ordering-Dependent Operations Break

```python
# Example: Session detection based on arrival order

Actual sequence (event times):
11:00 - Login
11:05 - View page A
11:10 - Add to cart
11:15 - Checkout

Arrival sequence:
11:10 - Add to cart      ← Arrives first!
11:00 - Login           ← Arrives second
11:15 - Checkout        ← Arrives third  
11:05 - View page A     ← Arrives last

# If you process in arrival order:
# Session looks like: cart → login → checkout → page view
# Makes no sense! User can't add to cart before logging in
```

#### Problem 4: Impossible to Know "Completeness"

```python
# Window [12:00-12:10], current time is 12:15

# You've seen events with timestamps:
# 12:00:15, 12:01:30, 12:03:45, 12:05:20, 12:08:10

# Questions:
# - Have I seen ALL events from 12:00-12:10?
# - Is there a mobile user offline with events from 12:02?
# - Did a network partition delay events from 12:07?

# Without watermarks: YOU DON'T KNOW! ⚠️
```

***

### How Streaming Systems Handle Out-of-Order Data

#### Approach 1: Ignore It (Processing-Time Windows)

```python
# Just process based on arrival
# Fast and simple, but incorrect for event-time analysis
```

**Pros:** Simple\
**Cons:** Wrong results

#### Approach 2: Buffer and Sort (Limited Reordering)

```python
# Buffer events for N seconds, sort by event time, then process
buffer = []
buffer_duration = 5  # seconds

def process():
    wait(5_seconds)
    sorted_events = sorted(buffer, key=lambda e: e.event_time)
    for event in sorted_events:
        handle(event)
```

**Pros:** Handles small amounts of disorder\
**Cons:** Doesn't handle large delays, adds latency

#### Approach 3: Watermarks + Late Data Handling (Modern Approach)

```python
# Use watermarks to estimate completeness
# Allow late data to trigger corrections

stream \
  .window(TumblingEventTimeWindows.of(10_minutes)) \
  .withWatermark(lag=5_minutes) \
  .trigger(
    AfterWatermark()
      .withLateFirings(AfterCount(1))
  ) \
  .accumulating()

# Timeline for window [12:00-12:10]:
# 12:15 - Watermark reaches 12:10 → emit "final" result
# 12:18 - Late event from 12:09 arrives → emit correction
```

**Pros:** Handles arbitrary delays, correct results\
**Cons:** More complex

***

**Measuring Out-of-Order-ness**

You can quantify how out-of-order your data is:

```python
# Event-time skew: How far back in time do events arrive?

class SkewMonitor:
    def __init__(self):
        self.max_event_time_seen = None
    
    def observe(self, event):
        if self.max_event_time_seen is None:
            self.max_event_time_seen = event.event_time
        else:
            skew = self.max_event_time_seen - event.event_time
            if skew > 0:
                print(f"Out of order by {skew} seconds")
            self.max_event_time_seen = max(
                self.max_event_time_seen, 
                event.event_time
            )

# Output might show:
# Out of order by 2 seconds
# Out of order by 0 seconds (in order)
# Out of order by 45 seconds  
# Out of order by 1800 seconds (30 minutes!)
```

**Real-World Example: E-commerce Analytics**

```python
# Scenario: Calculate sales per hour

# Reality:
# 14:00-15:00 → $10,000 in sales (100 orders)

# Problem: Orders arrive out of order
# - 90 orders arrive within 5 minutes (on time)
# - 5 orders arrive 10 minutes late (mobile checkout completed offline)
# - 3 orders arrive 2 hours late (payment gateway retry)
# - 2 orders never arrive (lost in network)

# Processing-time approach:
# 14:00-15:00 window closes at 15:05
# Count: 90 orders, $9,000 ❌ (missing $1,000)

# Event-time with watermarks:
# 14:00-15:00 window closes at 15:05 (5 min watermark lag)
# Initial: 90 orders, $9,000
# 15:10: +5 orders, updated to $9,500
# 17:00: +3 orders, updated to $9,800 (closer to truth)
# Final: 98 orders, $9,800 (still missing 2, but much better)
```

***

#### Takeaways from this

Out-of-order data is **inevitable** in distributed systems. The question isn't "how do I prevent it?" (you can't), but rather:

1. **Do I care about event time?**
   * Yes → Must handle out-of-order data (watermarks, buffering)
   * No → Can use processing-time (simpler but incorrect for temporal analysis)
2. **How much disorder do I expect?**
   * Milliseconds → Simple buffering might suffice
   * Hours → Need watermarks with aggressive late-data handling
3. **What's my tolerance for late data?**
   * Low → Tight watermarks, few corrections
   * High → Loose watermarks, many corrections

Out-of-order data is why modern streaming frameworks like Flink, Beam, and Kafka Streams have such sophisticated time-handling mechanisms. It's the core challenge that makes stream processing hard!

***
