Self/Past dependencies

Self/Past Dependencies - Explained in Depth

This is one of Maxime's most important warnings! Let me break it down:

What Are Self/Past Dependencies?

When a partition depends on a previous partition from the same table.

BAD PATTERN (Self-dependency):

Day 1: dim_user (empty) → load → dim_user_day1
Day 2: dim_user_day1 → changes → dim_user_day2
Day 3: dim_user_day2 → changes → dim_user_day3
Day 4: dim_user_day3 → changes → dim_user_day4
...
Day 1000: dim_user_day999 → changes → dim_user_day1000

Each day depends on the previous day!

The Problem: Complexity Score

Maxime introduces "complexity score" - how many partitions are needed to compute a given partition?

COMPLEXITY SCORE EXPLOSION:

To compute dim_user_day1000:
┌─────────────────────────────────────┐
│ Need: dim_user_day999               │
│ Which needs: dim_user_day998        │
│ Which needs: dim_user_day997        │
│ ...                                 │
│ Which needs: dim_user_day1          │
│                                     │
│ Complexity Score: 1000!             │
└─────────────────────────────────────┘

vs.

GOOD PATTERN (No self-dependency):

To compute dim_user_day1000:
┌─────────────────────────────────────┐
│ Need: raw_user_data_day1000         │
│ Need: reference_data (maybe)        │
│                                     │
│ Complexity Score: 2!                │
└─────────────────────────────────────┘

Concrete Example

Visual Representation of the Problem

Why This Matters for Backfills

The Common Culprit: Cumulative Metrics

Solutions

Solution 1: Recompute from scratch each time

Solution 2: Don't put cumulative metrics in dimensions

Solution 3: Use a specialized cumulative framework

Real-World Impact

Key Takeaway

The essence: Each partition should be independently computable from source data, not dependent on previous versions of itself. This keeps your complexity constant regardless of how much history you have, and enables parallelization and easy backfills.

Last updated