Functional Data Engineering

Talk from Maxime Beauchemin about this:

https://www.youtube.com/watch?v=4Spo2QRTz1k&t=504s

His article about the topic:

https://maximebeauchemin.medium.com/functional-data-engineering-a-modern-paradigm-for-batch-data-processing-2327ec32c42a

Blueprint for functional data engineering: https://www.dataengineeringweekly.com/p/functional-data-engineering-a-blueprint

Insights from the talk and the article

Core Definitions

Idempotence - The property of certain operations that they can be applied multiple times without changing the result beyond the initial application

His glass of water analogy is brilliant:

NOT Idempotent:
"Add a little water" → Run 5 times → Spillage!

Idempotent:
"Fill the glass with water" → Run N times → Same result

Immutable Partitions - Partitions are the building blocks of your data warehouse, and you should think of data lineage as a graph of partitions rather than just tables

Pure ETL Tasks

Pure ETL tasks are idempotent, deterministic (giving the same source partitions they produce the same target partition), have no side effects, use immutable sources, and usually target a single partition. This means never doing UPDATE, UPSERT, APPEND, or DELETE - only INSERT or INSERT OVERWRITE

Slowly Changing Dimensions - The Rant

This is one of the most passionate parts! He systematically demolishes traditional SCD approaches:

TYPE 1 (Override):
✗ Full of mutations
✗ Lose history
✗ Same query today ≠ yesterday

TYPE 2 (Add rows with surrogate keys):
✗ Super hard to manage
✗ Complex surrogate key lookups
✗ Makes loading facts harder
✗ Full of mutations

TYPE 3 (Add columns):
✗ "Kind of a half-ass approach"
✗ Bad compromise

HIS SOLUTION: Snapshot everything daily
✓ Storage is cheap
✓ Engineering time is expensive
✓ Mental model is simple
✓ Reproducibility is invaluable

His advice: "If you ever hear about slowly changing dimension again you can just say all that stuff is absolute nonsense"

Late-Arriving Facts

You need two time dimensions: event time and event processing time. Partition on event processing time so you can close the loop and land immutable blocks

The trade-off:

Partition on processing time:
✓ Can close partitions immediately
✓ Immutable staging area
✗ Lose partition pruning on event time

Mitigations:
- Execution engine optimizations (Parquet footers)
- Sub-partition by event time
- Repartition later for query-optimized tables

Self/Past Dependencies - Avoid!

Self or past dependencies should be avoided - when loading a dimension by taking yesterday's partition and applying changes leads to high "complexity scores" where you need many historical partitions to compute the current one, preventing parallelization and making backfills sequential

His "complexity score" concept:

Good: Partition depends on 5 source partitions
Bad: Partition depends on 1000+ sequential partitions

File Explosion Problem

Partitioning everything leads to file explosion in HDFS/S3. Mitigations include being careful with sub-partitioning, avoiding very short schedule intervals (5-minute partitions create too many files), and compiling earlier partitions together

The Big Picture: Times Have Changed

The landscape has completely changed since original data warehousing books were written - we now have cheap limitless storage, distributed databases with virtually infinite compute, read-optimized stores with immutable file formats, and everyone (not just small specialized teams) participates in data warehousing

His Philosophy

"First learn the rules and then break them" - it's good to know the methodology but you should make your own decisions based on your environment

Interesting Q&A Insights

On GDPR/Right to be Forgotten: Maxime suggests having an anonymization framework with metadata on tables indicating they contain non-anonymous data, with a background daemon that encrypts/hashes fields and moves immutable partitions into anonymized equivalents

On Small Data Becoming Big Data: For Facebook-scale user dimensions with billions of rows, you might need to rethink pure snapshotting and use a mix of techniques, vertical partitioning, or keep dimensions thinner by moving fields to fact tables

What Stands Out

His pragmatism - He acknowledges when rules should be broken (large dimensions, specialized frameworks for cumulative metrics)
Storage vs. Engineering Time - This is his core argument: duplicating data is cheaper than the complexity of managing mutations
Reproducibility obsession - Everything traces back to being able to reproduce results deterministically
The "complexity score" concept - A simple mental model for evaluating pipeline dependencies

Would you like to discuss any specific aspect from the talk in more depth? I'm particularly curious what resonated most with you!

PreviousFunctional Programming in Python NextPure ETL tasks

Last updated 3 months ago

hashtagInsights from the talk and the article

hashtagCore Definitions

hashtagPure ETL Tasks

hashtagSlowly Changing Dimensions - The Rant

hashtagLate-Arriving Facts

hashtagSelf/Past Dependencies - Avoid!

hashtagFile Explosion Problem

hashtagThe Big Picture: Times Have Changed

hashtagHis Philosophy

hashtagInteresting Q&A Insights

hashtagWhat Stands Out