# Functional Data Engineering

Talk from Maxime Beauchemin about this:

* <https://www.youtube.com/watch?v=4Spo2QRTz1k&t=504s>

His article about the topic:

* <https://maximebeauchemin.medium.com/functional-data-engineering-a-modern-paradigm-for-batch-data-processing-2327ec32c42a>

Blueprint for functional data engineering: <https://www.dataengineeringweekly.com/p/functional-data-engineering-a-blueprint>

***

### Insights from the talk and the article

#### **Core Definitions**

**Idempotence** - The property of certain operations that they can be applied multiple times without changing the result beyond the initial application

His glass of water analogy is brilliant:

```
NOT Idempotent:
"Add a little water" → Run 5 times → Spillage!

Idempotent:
"Fill the glass with water" → Run N times → Same result
```

**Immutable Partitions** - Partitions are the building blocks of your data warehouse, and you should think of data lineage as a graph of partitions rather than just tables

#### **Pure ETL Tasks**

Pure ETL tasks are idempotent, deterministic (giving the same source partitions they produce the same target partition), have no side effects, use immutable sources, and usually target a single partition. This means never doing UPDATE, UPSERT, APPEND, or DELETE - only INSERT or INSERT OVERWRITE

#### **Slowly Changing Dimensions - The Rant**

This is one of the most passionate parts! He systematically demolishes traditional SCD approaches:

```
TYPE 1 (Override):
✗ Full of mutations
✗ Lose history
✗ Same query today ≠ yesterday

TYPE 2 (Add rows with surrogate keys):
✗ Super hard to manage
✗ Complex surrogate key lookups
✗ Makes loading facts harder
✗ Full of mutations

TYPE 3 (Add columns):
✗ "Kind of a half-ass approach"
✗ Bad compromise

HIS SOLUTION: Snapshot everything daily
✓ Storage is cheap
✓ Engineering time is expensive
✓ Mental model is simple
✓ Reproducibility is invaluable
```

His advice: "If you ever hear about slowly changing dimension again you can just say all that stuff is absolute nonsense"

#### **Late-Arriving Facts**

You need two time dimensions: event time and event processing time. Partition on event processing time so you can close the loop and land immutable blocks

The trade-off:

```
Partition on processing time:
✓ Can close partitions immediately
✓ Immutable staging area
✗ Lose partition pruning on event time

Mitigations:
- Execution engine optimizations (Parquet footers)
- Sub-partition by event time
- Repartition later for query-optimized tables
```

#### **Self/Past Dependencies - Avoid!**

Self or past dependencies should be avoided - when loading a dimension by taking yesterday's partition and applying changes leads to high "complexity scores" where you need many historical partitions to compute the current one, preventing parallelization and making backfills sequential

His "complexity score" concept:

```
Good: Partition depends on 5 source partitions
Bad: Partition depends on 1000+ sequential partitions
```

#### **File Explosion Problem**

Partitioning everything leads to file explosion in HDFS/S3. Mitigations include being careful with sub-partitioning, avoiding very short schedule intervals (5-minute partitions create too many files), and compiling earlier partitions together

#### **The Big Picture: Times Have Changed**

The landscape has completely changed since original data warehousing books were written - we now have cheap limitless storage, distributed databases with virtually infinite compute, read-optimized stores with immutable file formats, and everyone (not just small specialized teams) participates in data warehousing

#### **His Philosophy**

"First learn the rules and then break them" - it's good to know the methodology but you should make your own decisions based on your environment

### Interesting Q\&A Insights

**On GDPR/Right to be Forgotten:** Maxime suggests having an anonymization framework with metadata on tables indicating they contain non-anonymous data, with a background daemon that encrypts/hashes fields and moves immutable partitions into anonymized equivalents

**On Small Data Becoming Big Data:** For Facebook-scale user dimensions with billions of rows, you might need to rethink pure snapshotting and use a mix of techniques, vertical partitioning, or keep dimensions thinner by moving fields to fact tables

### What Stands Out

1. **His pragmatism** - He acknowledges when rules should be broken (large dimensions, specialized frameworks for cumulative metrics)
2. **Storage vs. Engineering Time** - This is his core argument: duplicating data is cheaper than the complexity of managing mutations
3. **Reproducibility obsession** - Everything traces back to being able to reproduce results deterministically
4. **The "complexity score"** concept - A simple mental model for evaluating pipeline dependencies

Would you like to discuss any specific aspect from the talk in more depth? I'm particularly curious what resonated most with you!
