# Slowly Changing Dimensions (SCD)

### Slowly Changing Dimensions (SCD)

Maxime is quite passionate about this - he calls traditional SCDs "absolute nonsense"!

#### The Traditional Approaches (That He Hates)

**Type 1: Overwrite**

```
EXAMPLE: Supplier moves from Illinois to California

Before:
┌──────────┬──────────┬─────────┐
│SupplierID│   Name   │  State  │
├──────────┼──────────┼─────────┤
│   ABC    │   Acme   │   IL    │
└──────────┴──────────┴─────────┘

After UPDATE:
┌──────────┬──────────┬─────────┐
│SupplierID│   Name   │  State  │
├──────────┼──────────┼─────────┤
│   ABC    │   Acme   │   CA    │ ← Changed!
└──────────┴──────────┴─────────┘

Problems from Maxime:
✗ "Full of mutation"
✗ "You lose history"
✗ "Same query today and yesterday = different results"
✗ "If you're doing your taxes, that's probably not gonna fly"
✗ No reproducibility
```

**Type 2: Add Rows with Surrogate Keys**

```
EXAMPLE: Track supplier state changes over time

┌─────────┬──────────┬──────┬───────┬────────────┬──────────┐
│Surrogate│SupplierID│ Name │ State │EffectiveDate│ EndDate │
├─────────┼──────────┼──────┼───────┼────────────┼──────────┤
│   123   │   ABC    │ Acme │  IL   │ 2020-01-01 │2023-06-15│
│   124   │   ABC    │ Acme │  CA   │ 2023-06-16 │9999-12-31│
└─────────┴──────────┴──────┴───────┴────────────┴──────────┘

Loading Facts (the nightmare):
┌─────────────────────────────────────┐
│ For each fact record:               │
│ 1. Look at transaction date         │
│ 2. Find dimension record where      │
│    - Natural key matches            │
│    - EffectiveDate <= trans_date    │
│    - EndDate >= trans_date          │
│ 3. Get surrogate key (123 or 124?)  │
│ 4. Store surrogate key in fact      │
│                                     │
│ This is called "surrogate key       │
│ lookup" - it's expensive!           │
└─────────────────────────────────────┘

Maxime's problems:
✗ "Effectively super hard to manage"
✗ "Makes loading dimensions harder"
✗ "Makes loading facts harder"
✗ "Full of mutations"
✗ "Surrogate key lookups are complex and expensive"
✗ "Surrogate keys are unreadable"
✗ "Not guaranteed to fall back on your feet if you wipe and reload"
✗ Not reproducible
```

**Type 3: Add Columns**

```
EXAMPLE: Track current and previous state

┌──────────┬──────┬─────────────┬──────────────┐
│SupplierID│ Name │CurrentState │ PreviousState│
├──────────┼──────┼─────────────┼──────────────┤
│   ABC    │ Acme │     CA      │      IL      │
└──────────┴──────┴─────────────┴──────────────┘

Maxime's assessment:
✗ "Kind of a half-ass approach"
✗ "Bad compromise"
✗ Limited history (only 2 states here)
✗ Requires schema changes for more history
```

#### Maxime's Solution: Dimension Snapshots

```
FUNCTIONAL APPROACH: Full snapshots every day

Day 2024-01-01:
┌──────────┬──────┬───────┬───────────────┐
│SupplierID│ Name │ State │ partition_date│
├──────────┼──────┼───────┼───────────────┤
│   ABC    │ Acme │  IL   │  2024-01-01   │
│   XYZ    │ Corp │  NY   │  2024-01-01   │
│   DEF    │ Inc  │  TX   │  2024-01-01   │
└──────────┴──────┴───────┴───────────────┘

Day 2024-01-02:
┌──────────┬──────┬───────┬───────────────┐
│SupplierID│ Name │ State │ partition_date│
├──────────┼──────┼───────┼───────────────┤
│   ABC    │ Acme │  IL   │  2024-01-02   │ ← Same
│   XYZ    │ Corp │  NY   │  2024-01-02   │ ← Same
│   DEF    │ Inc  │  TX   │  2024-01-02   │ ← Same
└──────────┴──────┴───────┴───────────────┘

Day 2024-01-03:
┌──────────┬──────┬───────┬───────────────┐
│SupplierID│ Name │ State │ partition_date│
├──────────┼──────┼───────┼───────────────┤
│   ABC    │ Acme │  CA   │  2024-01-03   │ ← CHANGED!
│   XYZ    │ Corp │  NY   │  2024-01-03   │
│   DEF    │ Inc  │  TX   │  2024-01-03   │
└──────────┴──────┴───────┴───────────────┘

Each partition = complete dimension as-of that date
```

#### Addressing the "But That's Wasteful!" Objection

Maxime anticipates this and provides counter-arguments:

```
OBJECTION: "You're duplicating data every day!"

HIS RESPONSES:

1. Storage is Cheap
┌────────────────────────────────────┐
│ "Storage is cheap, compute is      │
│ cheap, there's virtually no limit" │
│                                    │
│ Modern cloud storage: $0.023/GB/mo│
│ 1000 rows x 365 days = pennies    │
└────────────────────────────────────┘

2. Dimensions are Small
┌────────────────────────────────────┐
│ "Dimension data in relationship to │
│ facts is usually very small"       │
│                                    │
│ Facts: Billions of rows            │
│ Dimensions: Thousands-millions     │
│                                    │
│ Even Facebook's user dimension     │
│ (billions) × 365 days is manageable│
└────────────────────────────────────┘

3. Engineering Time is Expensive
┌────────────────────────────────────┐
│ "Storage is cheap, engineering     │
│ time is expensive"                 │
│                                    │
│ Cost of managing Type 2 SCD:       │
│ - Complex ETL logic                │
│ - Surrogate key management         │
│ - Debugging issues                 │
│ - Training new engineers           │
│                                    │
│ >>> Cost of extra storage          │
└────────────────────────────────────┘

4. Mental Model Simplicity
┌────────────────────────────────────┐
│ "This mental model is a lot easier │
│ to reason about"                   │
│                                    │
│ New engineer joins:                │
│ - Snapshots: "Each day is a photo"│
│ - Type 2: "Well, you see..."      │
└────────────────────────────────────┘

5. Reproducibility is Invaluable
┌────────────────────────────────────┐
│ "Good reproducibility which is     │
│ invaluable"                        │
│                                    │
│ Can recompute any day from scratch │
│ Can trace exact state at any time  │
└────────────────────────────────────┘
```

#### How to Query Snapshot Dimensions

From the talk, Maxime shows simple patterns:

```
PATTERN 1: Join to latest dimension
(Most common - current attributes)

SELECT 
  f.order_id,
  f.amount,
  d.supplier_name,
  d.state
FROM facts f
JOIN dim_supplier d
  ON f.supplier_id = d.supplier_id
  AND d.partition_date = '2024-01-15'  -- Latest
WHERE f.partition_date = '2024-01-15'

Or with a view:
CREATE VIEW dim_supplier_current AS
SELECT *
FROM dim_supplier
WHERE partition_date = (SELECT MAX(partition_date) 
                        FROM dim_supplier)

Then just:
JOIN dim_supplier_current d
  ON f.supplier_id = d.supplier_id


PATTERN 2: Join to dimension as-of transaction time
(For historical accuracy - "attribute at time of event")

SELECT 
  f.order_id,
  f.amount,
  d.supplier_name,
  d.state
FROM facts f
JOIN dim_supplier d
  ON f.supplier_id = d.supplier_id
  AND f.partition_date = d.partition_date  -- Match dates!
WHERE f.partition_date = '2024-01-15'

This gives you: "What was the supplier's state 
                 when this order was placed?"


PATTERN 3: Time series on dimensions
(New capability!)

-- How many suppliers per state over time?
SELECT 
  partition_date,
  state,
  COUNT(*) as supplier_count
FROM dim_supplier
GROUP BY partition_date, state
ORDER BY partition_date, state

-- Track changes to specific supplier
SELECT 
  partition_date,
  supplier_name,
  state,
  LAG(state) OVER (ORDER BY partition_date) as prev_state
FROM dim_supplier
WHERE supplier_id = 'ABC'
```

#### When to Break the Rule

From the Q\&A, Maxime acknowledges exceptions:

```
FOR VERY LARGE DIMENSIONS:

If you work at Facebook with billions of users:
┌────────────────────────────────────┐
│ Options:                           │
│                                    │
│ 1. Mix with Type 2                 │
│    └─ Use Type 2 for large dims   │
│       Snapshots for small dims     │
│                                    │
│ 2. Vertical Partitioning           │
│    └─ Keep dimension thin          │
│       Move metrics to facts        │
│                                    │
│ 3. Lower Retention                 │
│    └─ Keep only recent snapshots   │
│       Archive older ones           │
│                                    │
│ 4. Denormalize into Facts          │
│    └─ Store attributes in facts    │
│       when "at-time" matters       │
└────────────────────────────────────┘
```

#### Modern Alternative: Nested/Complex Types

From Maxime's article (not in talk transcript):

```
Use complex data types for history tracking:

┌──────────┬──────┬────────────────────────────┐
│SupplierID│ Name │      state_history         │
├──────────┼──────┼────────────────────────────┤
│   ABC    │ Acme │ {'2020-01-01': 'IL',      │
│          │      │  '2023-06-16': 'CA'}      │
└──────────┴──────┴────────────────────────────┘

Benefits:
✓ History in one row
✓ No grain change (unlike Type 2)
✓ Dynamic (no schema changes)
✓ Query with map access: state_history['2023-01-01']
```
