# Late-Arriving Facts

### Late-Arriving Facts

This is a critical problem in modern data engineering, especially with mobile apps and unstable networks.

#### The Problem

```
SCENARIO: Mobile app events

Event occurs:     User makes purchase on phone
Event time:       2024-01-15 10:30:00

Network issues... phone offline...

Event arrives:    2024-01-17 08:15:00
Processing time:  2024-01-17 08:15:00

Problem: Event is 2 days late!
Where should it go?
```

#### The Naive Approach (Bad)

```
PARTITION BY EVENT TIME (problems):

┌────────────────────────────────────┐
│ Partition: 2024-01-15              │
│ ├─ Events from 2024-01-15          │
│ └─ But late events keep arriving!  │
│                                    │
│ Day 1: Close partition with 1000   │
│        events                      │
│ Day 2: Oops, 50 more late events!  │
│ Day 3: Oops, 10 more!              │
│ Day 7: Still getting stragglers!   │
└────────────────────────────────────┘

Issues:
✗ Can't close partitions (not immutable!)
✗ Can't compress/optimize files
✗ Must keep partitions mutable
✗ Violates functional principles
✗ When is it "safe" to close?
```

#### Maxime's Solution: Dual Time Dimensions

```
TWO TIME DIMENSIONS:

1. Event Time (when event occurred)
2. Event Processing Time (when received/processed)

┌────────────────────────────────────┐
│ Partition by: PROCESSING TIME      │
│ Column: EVENT TIME                 │
└────────────────────────────────────┘
```

#### Implementation

```
TABLE SCHEMA:

CREATE TABLE fact_purchases (
  purchase_id,
  customer_id,
  amount,
  event_time TIMESTAMP,      -- When purchase happened
  processing_time TIMESTAMP,  -- When we received it
  partition_date DATE         -- Based on processing_time
)
PARTITIONED BY (partition_date)


DATA FLOW:

Event occurs: 2024-01-15 10:30:00
Event arrives: 2024-01-17 08:15:00

INSERT INTO fact_purchases 
PARTITION(partition_date='2024-01-17')  ← Processing date!
VALUES (
  '12345',
  'customer_abc',
  99.99,
  '2024-01-15 10:30:00',  -- Event time
  '2024-01-17 08:15:00',  -- Processing time
  '2024-01-17'            -- Partition key
)
```

#### Benefits of This Approach

```
IMMUTABILITY PRESERVED:

┌────────────────────────────────────┐
│ Partition: 2024-01-17              │
│ ├─ All events processed on this day│
│ └─ Can close immediately!          │
│                                    │
│ At end of day:                     │
│ ✓ Compress to Parquet              │
│ ✓ Add indexes                      │
│ ✓ Mark immutable                   │
│ ✓ Never touch again                │
└────────────────────────────────────┘


TIME MACHINE CAPABILITY:

From the talk: "This effectively provides a 
time machine that allows you to understand 
what reality looked like at any point in time."

Query examples:
┌────────────────────────────────────┐
│ 1. "Total sales in Feb as of today"│
│    WHERE event_time BETWEEN        │
│      '2024-02-01' AND '2024-02-29' │
│                                    │
│ 2. "Total sales in Feb as of Mar 1"│
│    WHERE event_time BETWEEN        │
│      '2024-02-01' AND '2024-02-29' │
│    AND processing_time < '2024-03-01'│
│                                    │
│ 3. "How much Feb sales adjusted    │
│     since Mar 1?"                  │
│    Compare queries 1 and 2         │
└────────────────────────────────────┘
```

#### The Trade-Off: Partition Pruning

```
THE PROBLEM:

Query: Give me sales for Jan 15, 2024
WHERE event_time = '2024-01-15'

Partition layout (by processing_time):
├─ 2024-01-15/ (events processed on 15th)
├─ 2024-01-16/ (events processed on 16th) 
├─ 2024-01-17/ (events processed on 17th) ← Late events!
├─ 2024-01-18/ (events processed on 18th) ← More late!
└─ ... (stragglers for weeks)

Database optimizer:
"I need to scan ALL partitions because
 events with event_time='2024-01-15' 
 could be in ANY processing_time partition!"

This is called "losing partition pruning"
```

#### Maxime's Mitigations

From the talk, he provides several strategies:

```
MITIGATION 1: Execution Engine Optimization

Modern columnar formats (Parquet, ORC):
┌────────────────────────────────────┐
│ Parquet file structure:            │
│ ┌──────────────────────────────┐  │
│ │ File Header/Footer           │  │
│ │ - Min/Max values per column  │  │
│ │ - Row count                  │  │
│ │ - Schema                     │  │
│ └──────────────────────────────┘  │
│ ┌──────────────────────────────┐  │
│ │ Row Group 1                  │  │
│ │ Row Group 2                  │  │
│ │ ...                          │  │
│ └──────────────────────────────┘  │
└────────────────────────────────────┘

Query: WHERE event_time = '2024-01-15'

Engine process:
1. Scan partition 2024-01-17/file1.parquet
2. Read footer: "min_event_time=2024-01-17,
                 max_event_time=2024-01-17"
3. Skip entire file!
4. Move to next file

Result: "The damage is not that bad"
- Scanning more partitions
- But skipping most files quickly
```

```
MITIGATION 2: Double Predicate Pattern

Train users to query like this:

-- Bad (scans all partitions):
SELECT *
FROM fact_purchases
WHERE event_time = '2024-01-15'

-- Good (limits partition scan):
SELECT *
FROM fact_purchases
WHERE event_time = '2024-01-15'
  AND processing_time BETWEEN 
      '2024-01-15' AND '2024-01-22'  ← Add window!

Logic: "Most late events arrive within a week,
        so search processing_time window of 
        event_time ± 7 days"

Trade-off: Might miss very late stragglers
```

```
MITIGATION 3: Sub-Partitioning

Partition by BOTH dimensions:

/warehouse/fact_purchases/
  processing_date=2024-01-15/
    event_date=2024-01-15/
    event_date=2024-01-14/  ← Yesterday's late arrivals
    event_date=2024-01-13/  ← 2-day late arrivals
  processing_date=2024-01-16/
    event_date=2024-01-16/
    event_date=2024-01-15/  ← Late arrivals
    event_date=2024-01-14/

Benefits:
✓ Immutability (processing_date closes)
✓ Better pruning (event_date helps)

Drawbacks:
✗ More files (partition explosion)
✗ More complexity
✗ From talk: "Raises the complexity of the model"
```

```
MITIGATION 4: Dual Table Strategy

From the talk: "Read-optimized stores"

Staging (by processing_time):
└─ Immutable landing area
   Closes immediately
   Never touched

Analytics (by event_time):
└─ Optimized for queries
   Rebuilt periodically
   Repartitioned for performance

ETL Flow:
┌────────────────────────────────────┐
│ Raw events                         │
│   ↓ (land by processing_time)      │
│ Staging table (immutable)          │
│   ↓ (nightly rebuild)              │
│ Analytics table (by event_time)    │
│   └─ Rebuilt for last N days       │
└────────────────────────────────────┘

Example:
-- Staging (immutable, by processing_time)
fact_purchases_staging/
  partition_date=2024-01-15/
  partition_date=2024-01-16/
  ...

-- Analytics (query-optimized, by event_time)  
fact_purchases/
  event_date=2024-01-15/  ← Rebuilt nightly
  event_date=2024-01-16/  ← Rebuilt nightly
  event_date=2024-01-17/  ← Current
  ...

Nightly job:
INSERT OVERWRITE fact_purchases
PARTITION(event_date='2024-01-15')
SELECT * FROM fact_purchases_staging
WHERE event_time BETWEEN 
  '2024-01-15 00:00:00' AND '2024-01-15 23:59:59'
  AND processing_time < CURRENT_DATE()  -- Stable window
```

#### Complete Example

```
SCENARIO: Mobile game events

Day 1 (2024-01-15):
┌────────────────────────────────────┐
│ Events occur throughout the day    │
│ Most arrive same day               │
│ Some players offline               │
└────────────────────────────────────┘

Staging table (by processing_time):
fact_game_events/
  processing_date=2024-01-15/
    ├─ event1: event_time=2024-01-15 10:00
    ├─ event2: event_time=2024-01-15 11:00
    └─ event3: event_time=2024-01-15 12:00
    (1,000 events processed)

Day 2 (2024-01-16):
┌────────────────────────────────────┐
│ Late events from yesterday arrive  │
│ + Today's events                   │
└────────────────────────────────────┘

Staging table:
fact_game_events/
  processing_date=2024-01-15/  ← Closed! Immutable!
  processing_date=2024-01-16/
    ├─ event1000: event_time=2024-01-16 09:00
    ├─ event1001: event_time=2024-01-15 23:50  ← Late!
    └─ event1002: event_time=2024-01-15 22:30  ← Late!
    (950 today + 50 late from yesterday)


QUERIES:

-- Q1: Events from Jan 15 as we know today
SELECT COUNT(*)
FROM fact_game_events
WHERE event_time BETWEEN 
  '2024-01-15' AND '2024-01-16'
-- Result: 1,050 (1000 + 50 late arrivals)

-- Q2: Events from Jan 15 as we knew on Jan 15
SELECT COUNT(*)
FROM fact_game_events
WHERE event_time BETWEEN 
  '2024-01-15' AND '2024-01-16'
  AND processing_time < '2024-01-16'
-- Result: 1,000 (only what arrived on time)

-- Q3: How many late arrivals for Jan 15?
SELECT COUNT(*)
FROM fact_game_events
WHERE event_time BETWEEN 
  '2024-01-15' AND '2024-01-16'
  AND processing_time >= '2024-01-16'
-- Result: 50 (late arrivals)
```

#### Key Insight

Maxime's core message: **Don't fight late arrivals, embrace them!**

```
┌────────────────────────────────────────┐
│ OLD THINKING:                          │
│ "We need to wait for late events       │
│  before closing partitions"            │
│ → Delays, mutable data, complexity     │
│                                        │
│ NEW THINKING:                          │
│ "Land events when they arrive,         │
│  close partitions immediately,         │
│  query intelligently"                  │
│ → Immutability, simplicity, time travel│
└────────────────────────────────────────┘
```
