Functional Data Engineering

Talk from Maxime Beauchemin about this:

His article about the topic:

Blueprint for functional data engineering: https://www.dataengineeringweekly.com/p/functional-data-engineering-a-blueprintarrow-up-right


Insights from the talk and the article

Core Definitions

Idempotence - The property of certain operations that they can be applied multiple times without changing the result beyond the initial application

His glass of water analogy is brilliant:

NOT Idempotent:
"Add a little water" → Run 5 times → Spillage!

Idempotent:
"Fill the glass with water" → Run N times → Same result

Immutable Partitions - Partitions are the building blocks of your data warehouse, and you should think of data lineage as a graph of partitions rather than just tables

Pure ETL Tasks

Pure ETL tasks are idempotent, deterministic (giving the same source partitions they produce the same target partition), have no side effects, use immutable sources, and usually target a single partition. This means never doing UPDATE, UPSERT, APPEND, or DELETE - only INSERT or INSERT OVERWRITE

Slowly Changing Dimensions - The Rant

This is one of the most passionate parts! He systematically demolishes traditional SCD approaches:

His advice: "If you ever hear about slowly changing dimension again you can just say all that stuff is absolute nonsense"

Late-Arriving Facts

You need two time dimensions: event time and event processing time. Partition on event processing time so you can close the loop and land immutable blocks

The trade-off:

Self/Past Dependencies - Avoid!

Self or past dependencies should be avoided - when loading a dimension by taking yesterday's partition and applying changes leads to high "complexity scores" where you need many historical partitions to compute the current one, preventing parallelization and making backfills sequential

His "complexity score" concept:

File Explosion Problem

Partitioning everything leads to file explosion in HDFS/S3. Mitigations include being careful with sub-partitioning, avoiding very short schedule intervals (5-minute partitions create too many files), and compiling earlier partitions together

The Big Picture: Times Have Changed

The landscape has completely changed since original data warehousing books were written - we now have cheap limitless storage, distributed databases with virtually infinite compute, read-optimized stores with immutable file formats, and everyone (not just small specialized teams) participates in data warehousing

His Philosophy

"First learn the rules and then break them" - it's good to know the methodology but you should make your own decisions based on your environment

Interesting Q&A Insights

On GDPR/Right to be Forgotten: Maxime suggests having an anonymization framework with metadata on tables indicating they contain non-anonymous data, with a background daemon that encrypts/hashes fields and moves immutable partitions into anonymized equivalents

On Small Data Becoming Big Data: For Facebook-scale user dimensions with billions of rows, you might need to rethink pure snapshotting and use a mix of techniques, vertical partitioning, or keep dimensions thinner by moving fields to fact tables

What Stands Out

  1. His pragmatism - He acknowledges when rules should be broken (large dimensions, specialized frameworks for cumulative metrics)

  2. Storage vs. Engineering Time - This is his core argument: duplicating data is cheaper than the complexity of managing mutations

  3. Reproducibility obsession - Everything traces back to being able to reproduce results deterministically

  4. The "complexity score" concept - A simple mental model for evaluating pipeline dependencies

Would you like to discuss any specific aspect from the talk in more depth? I'm particularly curious what resonated most with you!

Last updated