Apache Hudi

Near real-time ingestion


Video from CMUarrow-up-right which pretty much covers all the theory you need about this technology.


Overview of Apache Hudi

The origins

Apache Hudi was developed internally at Uber in 2016 to address the critical latency and efficiency limitations of their massive 100PB+ data lake. Designed to replace the complex and slow "Lambda Architecture," Hudi introduced the ability to perform near real-time upserts and incremental processing directly on cloud storage. Uber open-sourced the project in 2017, and it entered the Apache Software Foundation incubator in 2019, graduating as a Top-Level Project in 2020.

Storage layout

spinner

Components stack

Native Table Format

At its core, the Hudi table format acts as a structured database layer atop your data lake storage. While the data itself resides in open formats (like Parquet or Avro), the table format acts as the intelligence layer—managing the schema, partitions, and file layout. It is this structure that transforms a static collection of files into a transactional entity capable of ACID guarantees, efficient upserts, and time travel.

The format relies on three distinct pillars:

1. File Groups and File Slices

Hudi moves beyond simple file storage by organizing data into File Groups, which serve as the primary sharding unit for the table.

  • The File Group: Each group is identified by a unique FileID that remains constant. When a record is inserted, it is hashed/indexed to a specific File Group, and all future updates to that record will stay within this same group.

  • The File Slice (Versioning): Within a File Group, data is managed as a series of versions called File Slices. A new slice is created whenever a commit modifies the group.

    • Base File: The foundational artifact of a slice, usually stored in a columnar format like Parquet (or ORC) for optimized read performance.

    • Log Files: Delta updates that occur after the base file was written are captured in row-based Avro log files.

    • The Merge: When you query a specific point in time, Hudi constructs the "view" by taking the Base File of the relevant slice and merging it on-the-fly with its associated Log Files.

2. The Timeline

The Timeline is the "heartbeat" of a Hudi table—an immutable log of all actions performed on the dataset. Physically stored in the .hoodie directory, it serves as the single source of truth for table state and concurrency control.

  • Event Logging: Every operation (Ingestion, Compaction, Cleaning) is recorded as an "Instant" on the timeline.

  • Instant States: To ensure atomicity, actions transition through specific states: REQUESTED (intent to start), INFLIGHT (currently running), and COMPLETED (successfully committed).

  • Capabilities: By referencing the timeline, readers can isolate a consistent snapshot of the data (avoiding partial writes) or "time travel" to retrieve the state of the table as it existed at a specific commit time in the past.

3. The Metadata Table

Traditionally, data lakes suffered from the "S3 Listing" bottleneck—the slow process of listing millions of files to plan a query. The Metadata Table, located at .hoodie/metadata, solves this by indexing the physical file layout.

  • Internal MoR Table: Interestingly, the Metadata Table is itself implemented as a special Hudi Merge-On-Read (MoR) table internal to the system.

  • Performance: Instead of expensive recursive file listings on cloud storage, Hudi queries this internal table to instantly retrieve file listings and partition information.

  • Advanced Indexing: Beyond just file locations, it stores Column Statistics (min/max values for data skipping) and Bloom Filters, allowing the query engine to completely ignore files that don't contain relevant keys, drastically reducing I/O.

This shows a Merge-on-Read (MOR) table evolution (correction to the image: File Slice @ t3 consists of Base File and Log File)

chevron-rightExplanation to the above imagehashtag

  • File Slice @ t1 (v1): Contains only a Base File in Parquet format, representing the initial state

  • File Slice @ t2 (v2): Adds a Log File in Avro format alongside the Base File, capturing incremental updates

  • File Slice @ t3 (v3): Contains both Base File and Log File

  • File Slice @ t4 (post-compaction): After compaction, the Base File consolidates all previous changes, removing the need for separate log files

  • t1, t2, t3: Represent commits at different dates (Jan 1, Feb 2, March 3, 2024), all marked as "completed"

  • t4: Marked as "COMPLETED (Compaction)" where log files are merged


While the "Table Format" (which we discussed previously) describes how data is laid out, the Storage Engine is the active "brain" that manages the lifecycle, performance, and transactional integrity of that data. It transforms a passive file system into a fully functional database-like system.

Pluggable Table Format (Interoperability Layer)

Hudi has evolved beyond just being a "format" to becoming a platform that can serve data to any ecosystem. The provided text highlights Hudi's ability to read/write other formats; technically, this is often achieved through Metadata Synchronization (sometimes referred to as "XTable" or Oneway).

  • How it works: Instead of duplicating the massive Parquet data files, Hudi simply translates its own metadata (The Timeline) into the native metadata structures of other formats (e.g., Iceberg Manifests or Delta Logs).

  • The Benefit: You write data once using Hudi's advanced writer (getting upserts and indexes), but downstream consumers can read it as if it were a native Iceberg or Delta table.

  • Significance: This effectively decouples the Storage Engine (Hudi) from the Catalog/Reader format, preventing vendor lock-in and allowing you to use the best-in-class tools for both writing (Hudi) and querying (Snowflake, BigQuery, etc.).

Indexes (The Performance Key)

In a standard data lake (like Hive), updating a record usually requires scanning the entire partition to find the file containing that key. Hudi’s indexing layer eliminates this "listing" bottleneck by maintaining a map of Record KeyFile Group ID.

This indexing happens during the "Tagging" phase of a write operation, ensuring we know exactly which file to touch before we write a single byte.

  • Bloom Index: The default for many workloads. It builds a Bloom Filter (probabilistic data structure) for the keys in each file.

    • Best for: Random updates where keys are spread across many files.

  • Simple Index: Joins the incoming update batch against the existing data on storage.

    • Best for: Small tables where scanning is cheaper than maintaining a separate index.

  • Bucket Index: Uses a deterministic hash of the record key to map it to a static "bucket" (file group).

    • Best for: High-throughput writes. It requires zero index lookup overhead because the location is calculated mathematically.

  • Record Level Index (RLI): A global index (often stored in the Metadata Table using H-Files) that maps every single key to its location.

    • Best for: Massive scale global lookups where you need to verify uniqueness across terabytes of data.

Concurrency Control (The "Traffic Cop")

Hudi manages how multiple writers and readers interact without stepping on each other's toes.

  • MVCC (Multi-Version Concurrency Control):

    • Concept: Readers and Writers never block each other.

    • Mechanism: If a user is reading the table at t1 (Snapshot V1), and a writer is committing t2 (Snapshot V2), the reader continues to see the consistent state of V1. They are isolated by the Timeline.

  • OCC (Optimistic Concurrency Control):

    • Concept: Multiple writers can try to write at the same time.

    • Mechanism: Before committing, a writer checks: "Did anyone else change the files I modified while I was working?" If yes, the write fails and retries. If no, it commits.

  • NBCC (Non-Blocking Concurrency Control):

    • Concept: A unique capability of Hudi (often enabled by Bucket Indexing).

    • Mechanism: If Writer A is assigned Bucket 1, and Writer B is assigned Bucket 2, they mathematically cannot conflict. Therefore, they can write in parallel without acquiring any locks or performing any conflict checks. This is ideal for massive streaming ingestion.

Lake Cache (The "Buffer Pool")

This is an emerging component designed to solve the latency of object storage (S3/GCS).

  • The Problem: Query engines usually have their own ephemeral caches, but they are lost when the cluster shuts down.

  • The Solution: Hudi's Lake Cache acts like a database Buffer Pool. It caches specific File Slices (merged data). Because it is integrated with the Hudi Timeline, it is "smart"—it knows exactly when a cached block is invalid (stale) because a new commit has happened. It enables split-second query responses on data lakes.

Table Services (The Maintenance Crew)

Just like a database has background threads (vacuum, checkpointing), Hudi has Table Services. These ensure the raw data remains efficient to read and store.

  • Compaction (Row-Level Repair):

    • Role: Converts "Messy" fast writes into "Clean" fast reads.

    • Action: Merges Log Files (Row-based) into Base Files (Columnar). Crucial for MoR tables.

  • Clustering (File-Level Optimization):

    • Role: Fixes the "Small File Problem" and sorts data.

    • Action: It takes many small 10MB files and combines them into optimal 1GB files. It can also Z-Order (Sort) data during this process to make queries faster by grouping similar keys together.

  • Cleaning (Garbage Collection):

    • Role: Reclaims storage space.

    • Action: Deletes old File Slices that are no longer needed for the retention policy (e.g., "Keep the last 10 versions"). This prevents storage costs from exploding.

Summary

The Storage Engine is what makes Hudi a "Data Lakehouse."

  1. Indexes find the data fast.

  2. Concurrency lets you write safely.

  3. Table Services keep the storage clean and optimized.

  4. Pluggable Formats let you share that data with the world.


While the "Table Format" defines the structure and the "Storage Engine" manages the mechanics, the Programming API is the interface where developers actually live. It provides the "knobs and dials" to control how data enters and leaves the lake, offering a level of granularity that standard SQL often hides.

The Writers (Ingestion & Management)

Hudi writers are not just simple file connectors; they are sophisticated state engines. When you write to a Hudi table using Spark, Flink, or Java, you are invoking a complex workflow that guarantees data integrity before a single byte hits the disk.

  • Record Keys at the Core (The "Primary Key"): Unlike standard Parquet tables where rows are anonymous, Hudi enforces a Primary Key constraint. This is the foundation of the entire system.

    • Why it matters: It allows Hudi to handle De-duplication automatically. If you send a batch of 1,000 records, and 50 are duplicates of existing rows, Hudi identifies them via the key and updates them rather than creating duplicates.

    • Key Generators: Hudi provides flexible logic to derive these keys. You can use a SimpleKeyGenerator (one column), ComplexKeyGenerator (composite keys), or even a TimestampBasedKeyGenerator for time-series data.

  • Operation-Based Optimizations: The API allows developers to declare their intent, and Hudi optimizes the execution path:

    • Upsert: The default mode. It uses the Index to tag incoming records as "Inserts" (new key) or "Updates" (existing key).

    • Insert: Faster than upsert. It skips the index lookup and uses heuristics (like small file handling) to pack data efficiently.

    • Bulk Insert: The "Big Bang" loader. It sorts data globally or by partition before writing, ensuring perfectly sized files. This is ideal for initial backfills where you don't need to check for duplicates against existing data.

  • Custom Merge Logic (Payloads): What happens when two writers update the same key at the same time? Or when an update arrives out of order?

    • Merge Modes: You can define the "conflict resolution" strategy. The default is Last-Write-Wins (based on a timestamp), but the API exposes a Payload interface where you can write custom Java/Scala logic (e.g., "If the new value is null, ignore it; otherwise, sum it with the old value").


The Readers (Consumption & Analytics)

The Hudi Reader API transforms the data lake from a static repository into a dynamic stream of changes. It abstracts away the complexity of file versions, offering clean views to downstream consumers.

  • Snapshot Isolation (The "ACID" Guarantee): Readers never block writers, and writers never block readers.

    • How: When a reader starts a query, it gets a "Snapshot" of the Timeline. Even if a massive compaction or upsert job finishes while the query is running, the reader continues to see the consistent state from when it started. This eliminates the "dirty reads" common in traditional file-based ETL.

  • Time Travel & Point-in-Time Recovery: Because Hudi tracks the commit time of every change, the API allows you to query the table as it existed in the past.

    • Syntax: spark.read.format("hudi").option("as.of.instant", "20231220100000").load(...)

    • Use Case: Instant recovery from bad data pushes or reproducible machine learning model training.

  • Incremental Queries (The "Stream" View): This is Hudi's superpower. Instead of scanning the full table to find "what changed yesterday," the API allows you to pull only the change stream.

    • Regular Incremental: Returns the latest state of any record changed after timestamp X.

    • CDC (Change Data Capture): Returns the "Before" and "After" image of the record, allowing you to see exactly how a value evolved (e.g., status changed from PENDING -> APPROVED).

The Writer & Reader API Flow

spinner

Diagram Breakdown

1. The Writers (Top):

  • Input: Your application (Spark/Flink/Java) sends data to Hudi.

  • Key Generator: First, unique keys are derived (Record Key + Partition Path) — essential for de-duplication and routing.

  • Logic: The data goes through specific paths based on the operation:

    • Upserts/Deletes trigger an Index Lookup (Tagging) using one of the index types (Bloom/Simple/Bucket/RLI) to find existing records.

    • Inserts/Bulk Inserts use Bin-Packing & Sorting heuristics to handle small files efficiently.

    • Merge/Payload Logic: Custom logic determines how to combine records (precombine or custom merge strategies).

  • Write Phase: After tagging or bin-packing, data is written to Data Files (Base Files for CoW, Log Files for MoR).

  • Commit Phase: OCC/MVCC checks ensure thread safety and conflict resolution before the write is finalized.

  • Commit to Timeline: The operation creates an instant on the Timeline, making the data visible to readers.

2. The Storage (Center):

  • The central hub containing three key components:

    • Timeline: The source of truth for table state, tracking all completed and in-progress instants.

    • Data Files: Physical storage containing Base Files (Parquet/Avro columnar) and Log Files (Avro row-based for MoR).

    • Metadata Table: Stores file listings, column statistics, and bloom filters for query optimization.

3. The Table Services (Maintenance):

  • Background Async Processes that run independently to optimize storage:

    • Compaction: Merges Log Files into Base Files (critical for MoR tables).

    • Clustering: Combines small files into larger optimized files, optionally applying Z-Order sorting.

    • Cleaning: Deletes old file versions based on retention policies to reclaim storage.

  • Triggered by the Timeline and operate on Data Files to maintain optimal performance.

4. The Readers (Bottom):

  • Query Engines: Tools like Spark, Flink, Trino, Hive, and Presto read from Hudi storage.

  • Query Types: The Reader API translates user intent into storage actions:

    • Snapshot Query: Merges Base + Log files using the Timeline to provide the latest consistent view.

    • Incremental Query: Reads only the changes (deltas) from the Timeline for CDC or streaming use cases.

    • Time Travel Query: Accesses historical snapshots by referencing past instants on the Timeline.

    • Read-Optimized Query: Skips the log files and reads only Base Files for faster columnar scans (eventual consistency).

  • Metadata Table: Provides file listings, statistics, and bloom filters to query engines for efficient query planning and data skipping.


User Access and Shared Platform Components layers

These final two layers are arguably the most visible to the end-user. "User Access" defines how you talk to the data, while "Shared Platform Components" defines how you operate the platform in production without reinventing the wheel.

User Access (The Interface Layer)

This layer acts as the bridge between the complex internals of the storage engine and the diverse needs of data consumers. Whether you are a Data Engineer writing ETL, an Analyst running SQL, or a Data Scientist building models, Hudi provides a native interface for your tool of choice.

SQL (The Universal Language)

Hudi turns your file system into a database that speaks SQL. It abstracts away the complexity of Parquet files, Avro logs, and compaction instants, presenting a clean, tabular interface to query engines.

  • Batch & Streaming (Spark & Flink):

    • Most heavy-lifting ETL is done here. Spark is the "native tongue" of Hudi, supporting sophisticated Merge Into (upsert) syntax that mimics standard SQL.

    • Real-time: Uniquely, Hudi supports incremental streaming SQL. You can write a streaming query in Flink or Spark Structured Streaming that continuously listens to a Hudi table and processes only the new changes as they arrive, acting as a high-performance message bus.

  • Interactive Analytics (Trino, Presto, Hive):

    • Hudi is designed to "meet queries where they are." You don't need a special "Hudi Query Server." You simply point Trino or Presto at your S3 bucket (via Glue/Hive Metastore), and they treat Hudi tables like standard tables—but with the added benefit of seeing the latest snapshots immediately.

  • OLAP Acceleration (StarRocks, ClickHouse):

    • For sub-second latency on massive datasets, Hudi integrates with high-performance OLAP engines. These engines can ingest data directly from Hudi or query it in place (external tables), leveraging Hudi's metadata to prune files and read data efficiently.

Code (The Developer's Toolkit)

While SQL is great for logic, modern data science and complex engineering often require full programming languages.

  • JVM (Java/Scala): The core Hudi library allows deep integration. You can write custom payload logic (how to merge two records) or extend key generators in Java.

  • Python (The Data Science Standard):

    • Hudi has expanded heavily into the Python ecosystem (PyHudi).

    • Ray & Daft: Support for these modern distributed Python frameworks means you can now hydrate ML training data directly from Hudi tables into dataframes without needing a heavy Spark cluster.

  • Native Bindings (Rust/C++): By offering Rust bindings, Hudi ensures it can be embedded into ultra-fast, native execution engines, removing the JVM overhead for high-performance reading.

Shared Platform Components (The "Batteries Included" Utilities)

One of Hudi's strongest differentiators is that it is not just a format specification; it is a platform. It includes a suite of battle-tested utilities that solve the "boring" but difficult parts of data engineering, so you don't have to write them from scratch.

Effortless Streaming & Ingestion (Hudi Streamer)

Writing a production-grade ingestion job that handles checkpoints, schema evolution, and retries is hard.

  • The Solution: The Hudi Streamer (formerly DeltaStreamer) is a "zero-code" utility tool.

  • Capabilities: You simply provide a configuration file ("Read from Kafka Topic X, Write to Hudi Table Y"), and it handles the rest.

  • CDC Integration: It has native support for Debezium and Kafka Connect. It can read raw database change logs and perfectly replicate them into Hudi, automatically handling the "hard stuff" like hard deletes and out-of-order updates.

Seamless Catalog & Metadata Management

A file on S3 is useless if your query engine doesn't know it exists.

  • Sync Utilities: As Hudi writes data, it actively synchronizes metadata to catalogs like AWS Glue, Hive Metastore, or DataHub.

  • Why it matters: This ensures that the second a write commits, the table is visible in Trino or Snowflake. It automates the "Update Partition" drudgery.

Comprehensive Administration & Monitoring

Running a data lake at scale requires visibility.

  • Hudi CLI: A command-line tool that lets you inspect the timeline, view file sizes, roll back bad commits, and repair the table manually if needed.

  • Observability: Hudi emits rich metrics (Commit Latency, File Size histograms, Compaction duration) to Prometheus, Datadog, and CloudWatch. This allows platform engineers to set alerts (e.g., "Alert me if Compaction takes > 10 minutes").

This diagram visualizes how users interact with the system (Top) and how the platform utilities (Side) support the core operation.

spinner



Last updated