Functional Data Engineering
Talk from Maxime Beauchemin about this:
His article about the topic:
Blueprint for functional data engineering: https://www.dataengineeringweekly.com/p/functional-data-engineering-a-blueprint
Insights from the talk and the article
Core Definitions
Idempotence - The property of certain operations that they can be applied multiple times without changing the result beyond the initial application
His glass of water analogy is brilliant:
NOT Idempotent:
"Add a little water" → Run 5 times → Spillage!
Idempotent:
"Fill the glass with water" → Run N times → Same resultImmutable Partitions - Partitions are the building blocks of your data warehouse, and you should think of data lineage as a graph of partitions rather than just tables
Pure ETL Tasks
Pure ETL tasks are idempotent, deterministic (giving the same source partitions they produce the same target partition), have no side effects, use immutable sources, and usually target a single partition. This means never doing UPDATE, UPSERT, APPEND, or DELETE - only INSERT or INSERT OVERWRITE
Slowly Changing Dimensions - The Rant
This is one of the most passionate parts! He systematically demolishes traditional SCD approaches:
His advice: "If you ever hear about slowly changing dimension again you can just say all that stuff is absolute nonsense"
Late-Arriving Facts
You need two time dimensions: event time and event processing time. Partition on event processing time so you can close the loop and land immutable blocks
The trade-off:
Self/Past Dependencies - Avoid!
Self or past dependencies should be avoided - when loading a dimension by taking yesterday's partition and applying changes leads to high "complexity scores" where you need many historical partitions to compute the current one, preventing parallelization and making backfills sequential
His "complexity score" concept:
File Explosion Problem
Partitioning everything leads to file explosion in HDFS/S3. Mitigations include being careful with sub-partitioning, avoiding very short schedule intervals (5-minute partitions create too many files), and compiling earlier partitions together
The Big Picture: Times Have Changed
The landscape has completely changed since original data warehousing books were written - we now have cheap limitless storage, distributed databases with virtually infinite compute, read-optimized stores with immutable file formats, and everyone (not just small specialized teams) participates in data warehousing
His Philosophy
"First learn the rules and then break them" - it's good to know the methodology but you should make your own decisions based on your environment
Interesting Q&A Insights
On GDPR/Right to be Forgotten: Maxime suggests having an anonymization framework with metadata on tables indicating they contain non-anonymous data, with a background daemon that encrypts/hashes fields and moves immutable partitions into anonymized equivalents
On Small Data Becoming Big Data: For Facebook-scale user dimensions with billions of rows, you might need to rethink pure snapshotting and use a mix of techniques, vertical partitioning, or keep dimensions thinner by moving fields to fact tables
What Stands Out
His pragmatism - He acknowledges when rules should be broken (large dimensions, specialized frameworks for cumulative metrics)
Storage vs. Engineering Time - This is his core argument: duplicating data is cheaper than the complexity of managing mutations
Reproducibility obsession - Everything traces back to being able to reproduce results deterministically
The "complexity score" concept - A simple mental model for evaluating pipeline dependencies
Would you like to discuss any specific aspect from the talk in more depth? I'm particularly curious what resonated most with you!
Last updated