Apache Iceberg

Best for batch workloads


Docsarrow-up-right


chevron-rightOn demand videos about Apache Iceberghashtag

Video from CMUarrow-up-right


Apache Iceberg's distinctive architecture goes beyond addressing Hive's limitations, enabling a wide range of advanced capabilities for data lakes and lakehouse environments. This overview introduces Iceberg's core features.

Apache Iceberg Architecture

Apache Iceberg uses a three-layer metadata architecture that separates the data files from their organization and management:

Diagram of Apache Iceberg architecture

spinner

Data Layer: Data and Delete files

  • Composition: Contains the raw records stored in open file formats (Parquet, ORC, or Avro) alongside Delete Files. While format-agnostic, Apache Parquet is the industry standard due to its superior columnar compression and query performance.

  • Function: Serves as the physical repository for the information query engines retrieve to satisfy user requests.

  • Immutability: Files are write-once (immutable). Once written to storage (S3, ADLS, GCS, or HDFS), they are never modified in place; updates or deletes result in new files being created.

  • Logical Management: Data is managed logically via metadata rather than physical directory structures. Operations like partitioning or re-organizing data do not require expensive physical file movements, only metadata updates.

Data Files

What are they?

Data files contain the actual content of your tableβ€”the rows of logs, transaction history, or customer details you are analyzing. Unlike the metadata files (which use Avro and JSON), data files use standard open-source formats optimized for big data processing:

  • Apache Parquet: (Most common) Columnar storage, best for analytics and querying specific columns.

  • Apache Avro: Row-based storage, best for heavy write loads or streaming.

  • Apache ORC: Optimized Row Columnar, often used in Hive environments.

Key Characteristics of Data Files

  • Immutable (Write-Once): This is a critical database concept. Once a data file is written to storage (like S3 or HDFS), it is never opened and modified.

    • If you update a row: Iceberg does not edit the file. Instead, it writes a new data file containing the updated row (Copy-On-Write) or writes a special "Delete File" (Merge-On-Read) to flag the old record as removed.

  • Partition Aware: A single data file never spans across multiple partitions. If your table is partitioned by Month, a file will strictly contain data for only "January" or only "February". This allows the query engine to ignore huge chunks of files that don't match a query's time range (Partition Pruning).

  • Invisible by Default: Just because a data file exists in your storage bucket doesn't mean it is part of the table. It is only "visible" to a query if it is actively tracked by a Manifest File in the current Snapshot. This allows for "safe" writesβ€”you can upload 1,000 data files, but nobody sees them until you commit the metadata update that points to them.

Apache Iceberg tries to be flexible in terms of file formats. That has 3 reasons:

  • Organizational Reality: It accommodates legacy decisions and diverse teams that may have standardized on different formats over time.

  • Workload Optimization: Users can match the format to the specific task. For example, Parquet is typically used for large-scale analytical (OLAP) queries, while Avro is often preferred for low-latency streaming and write-heavy tasks.

  • Future-Proofing: Iceberg is architected to easily adopt new, superior file formats that may emerge in the future.

The Dominance of Apache Parquet Despite mentioned flexibility

Parquet is the industry standard for Iceberg tables. Its popularity stems from its columnar structure, which unlocks massive performance gains for analytical workloads.

  • Parallelism: Files can be split into multiple parts, allowing engines to read them in parallel.

  • Efficiency: Columnar storage allows for superior compression (smaller storage footprint) and higher read throughput.

  • Pruning: Parquet stores statistics (like min/max values) that allow query engines to "skip" chunks of data that don't match the query filter.

The "Hidden" Data Files: Delete Files

In Iceberg v2 tables, there is a special type of data file called a Delete File.

  • Role: Instead of rewriting massive Parquet files just to remove one row, Iceberg writes a small "Delete File" that simply lists which rows in the existing data files should be ignored.

As previously discussed, data files in object storage are immutableβ€”they cannot be edited in place. When a row needs to be deleted or updated, Iceberg must write a new file. There are two strategies for this:

  1. Copy-On-Write (COW): The system rewrites the entire data file (minus the deleted row) into a new data file. This means that Copy-On-Write (COW) does not use delete files.

  2. Merge-On-Read (MOR): The system keeps the original data file exactly as it is. Instead, it writes a small Delete File that lists which records should be ignored.

Delete Files are the core mechanism of the Merge-On-Read (MOR) strategy. When a query engine reads the table, it reads the original data plus the delete files, merging them on the fly to hide the deleted rows.

Note: Delete files are only supported in Iceberg v2 format tables.

Types of Delete Files

There are two distinct ways Iceberg identifies which rows to remove: Positional and Equality delete files.

Let's consider a scenario to demonstrate how they work.

The Scenario: User Activity Logs

Imagine you manage a table called user_logs that tracks actions on a website.

  • Schema: user_id, event_type, timestamp, page_url.

  • Context: You have millions of rows stored in Parquet files.

1. Positional Delete Files (The "Map" Approach)

  • Concept: You identify the exact physical coordinates (File Path + Row Number) of the row you want to remove. This usually happens after a system has already scanned the data and found exactly which rows are "bad."

  • New Example: Your ingestion job accidentally wrote a duplicate log entry. You run a deduplication job that finds the exact duplicate is located in file_log_v1.parquet at row index 5,092.

  • The Delete File: Iceberg writes a file containing a specific pointer: path: "s3://.../file_log_v1.parquet", position: 5092.

  • Read Process: The engine opens file_log_v1.parquet. It reads rows 0 to 5,091 normally. It sees the "tombstone" marker for row 5,092 and skips it entirely. It continues reading from 5,093.

spinner

This method: Slower to write (must read data to find the index), but fast to read (engine just skips that index).

  1. Equality Delete Files (The "Match" Approach)

    • Concept: You define a rule (a filter), and the engine deletes anything that matches that rule. You don't know (or care) where the data physically lives; you just want that specific data gone.

    • New Example: A user with user_id: "user_555" requests to have their data deleted (GDPR "Right to be Forgotten").

    • The Delete File: Instead of rewriting terabytes of log files, Iceberg writes a small metadata file that says: "Ignore any row where user_id is equal to 'user_555'."

    • Read Process: When an analyst queries the logs, the engine reads the data files and filters out every record belonging to user_555 on the fly.

spinner

This method: Fast to write (no need to read data first), but slower to read (engine checks every row against the rule).

Summary Comparison

Feature

Positional Deletes

Equality Deletes

Identifies Row By

Physical location (File Path + Row Index)

Data Value (id = 1234)

Write Cost

Higher: Must read the data first to find the exact row number.

Lower: Just writes the value (e.g., "Delete ID 1234") without reading current data.

Read Cost

Lower: Engine knows exactly which bits to skip.

Higher: Engine must compare every row against the delete values.


Metadata Layer: Architecture & File Formats

The Metadata Layer acts as the "brain" of an Apache Iceberg table. It is a hierarchical tree structure that tracks every data file, delete file, and operational history. Crucially, these files are colocated with your data in object storage, enabling features like Time Travel and Schema Evolution without needing a central database.

Below is the hierarchy from the root down to the leaves, including the rationale for the file formats chosen at each level.

1. Metadata File (.json)

  • Format: JSON (e.g., 00001-metadata.json)

  • Why JSON? The root metadata is stored in JSON to be human-readable and easily parseable by any client or system. It serves as the "entry point" to the table, and using a text-based format makes debugging and external tooling significantly easier.

  • Role: This file defines the table's "identity." It stores the schema, partition definitions, and a list of all Snapshots (versions of the table).

  • Atomic Updates:

    • Every time you write to the table, a new Metadata JSON is created.

    • The catalog updates a pointer to this new file in a single atomic action.

    • Benefit: This creates a linear, immutable history. Writers don't block readers, and if multiple engines write simultaneously, the "winner" is determined by which one successfully swaps this pointer.

2. Manifest List (.avro)

  • Format: Apache Avro

  • Why Avro? As tables grow, they can generate thousands of manifest files. Avro is used here because it is a compact, binary format. It allows the query engine to efficiently scan the list and perform fast filtering of entire manifests based on partition ranges without parsing bulky text.

  • Role: The Snapshot Index. This file represents a single Snapshot (point in time). It lists all the Manifest Files that make up the table at that specific moment.

  • Key Data: It stores partition-level stats (e.g., "This manifest only has data for partition=2024-01").

  • Benefit: The query engine reads this list first and enables Partition Pruning. If a user queries "January data," the engine skips any Manifest that tracks "February data" without even opening it.

3. Manifest Files (.avro)

  • Format: Apache Avro

  • Why Avro? Similar to the Manifest List, Avro is chosen for its row-oriented binary structure. It allows the engine to efficiently scan file statistics (like column bounds) to decide which specific data files to retrieve, minimizing I/O overhead.

  • Role: The Detailed Inventory. These are the leaves of the metadata tree. They contain the list of actual file paths (Data and Delete files).

  • Separation of Duties: While Manifests track both file types, they never mix them. A single manifest file will contain only Data Files or only Delete Files.

  • The "Hive" Problem vs. Iceberg Solution:

    • Legacy (Hive): Statistics were expensive. You wrote the data, then had to run a separate, heavy read job to calculate stats. Because it was slow, stats were often stale.

    • Iceberg: Statistics are "Lightweight." The engine calculates stats (min/max/counts) in memory while it writes the data file and saves them immediately to the Manifest.

    • Result: Iceberg stats are always up-to-date, allowing the engine to prune unnecessary files efficiently.

4. Puffin Files (.puffin)

  • Format: Binary (Puffin)

  • Why Puffin? Standard Manifest stats (min/max) are great for filtering but insufficient for complex aggregations like "Count Distinct." Puffin files are designed to hold "Blobs" of arbitrary binary data, specifically statistical sketches.

  • Role: The Advanced Stats Sidecar.

  • Usage: Currently, Iceberg supports Theta Sketches (from the Apache DataSketches library).

  • Benefit: It provides fast, approximate answers (e.g., "99.8% accurate count of unique users") for dashboarding and heavy aggregations, avoiding the need to load millions of raw rows into memory.

Visualizing the Metadata Architecture

This diagram illustrates the full hierarchy, highlighting the specific file formats (.json vs .avro) and the separation of Data vs. Delete manifests.

spinner

The Catalog Layer: The Entry Point

If the Metadata Layer is the "brain" and the Data Layer is the "body," the Catalog Layer is the "Address Book."

Anyone reading from a tableβ€”whether a single user or a massive distributed engineβ€”needs to know where to start. With thousands of files in storage (metadata JSONs, Avros, Parquets), finding the current state of the table is impossible without a map. The Catalog serves as this central repository.

The Core Requirement: Atomic Swaps

The primary technical requirement for any system serving as an Iceberg Catalog is that it must support atomic operations for updating the current metadata pointer.

  • Why is this non-negotiable? In a data lake, you might have multiple writers (e.g., a Spark streaming job and an Airflow batch job) trying to update the table simultaneously.

  • The Mechanism: The Catalog stores a reference (a pointer) to the current metadata file (e.g., v2.metadata.json). When a write operation finishes, it asks the Catalog to swap that pointer to the new metadata file (v3.metadata.json).

  • The Guarantee: This swap happens in a single, atomic step. All readers and writers see the same state of the table at any given instant. You never end up with a "half-updated" table.

Catalog Backends & Pointer Storage

Because the only real requirement is storing a pointer and providing atomic guarantees, many different backends can serve as a Catalog. However, how they store this pointer varies:

  • Hadoop Catalog (File-System Based):

    • Pointer Mechanism: It relies on the file system itself. It writes a file called version-hint.text in the table's metadata folder.

    • Content: This file contains the version number of the current metadata file.

    • Note: This is common when using Amazon S3 or HDFS directly as the catalog.

  • Hive Metastore:

    • Pointer Mechanism: It uses the existing Hive table entry.

    • Content: It stores the full path to the current metadata file in a specific table property (often metadata_location).

  • Project Nessie:

    • Pointer Mechanism: Nessie is designed for "Git-like" data versioning.

    • Content: The table entry has a property called metadataLocation that tracks the current file.

  • AWS Glue Catalog:

    • Pointer Mechanism: Similar to Hive, it stores the metadata pointer in the table definition properties within the Glue service.

Inspecting the Catalog (SQL Example)

You can query the catalog to see exactly where it is pointing. If you are using a catalog like AWS Glue, you can inspect the system tables to see the history of these pointers.

Example Query:

Let's imagine we are inspecting a customer_transactions table.

Output Analysis:

The result gives you the precise location the engine will read next.

Timestamp

Metadata File

Latest Snapshot ID

2025-12-22 14:30:00

s3://finance-bucket/transactions/metadata/00045-8a9b...metadata.json

9876543210...

What this tells us:

The Catalog is currently pointing to file version 00045.... Even if file 00046... exists in the bucket (perhaps from a failed write or an ongoing transaction), the engine will ignore it because the Catalog hasn't "blessed" it yet.

spinner

How Apache Iceberg performs read/write operations

Part 1: Writing Queries in Apache Iceberg

Writing to an Iceberg table isn't just dumping files into a folder. It triggers a specific chain of events in the metadata layer to ensure ACID compliance.

1. Create the Table

This initializes the table structure. Unlike a standard file system operation, this creates the first layer of the metadata tree.

  • SQL Example:

  • What happens behind the scenes:

    1. The engine validates the schema.

    2. It creates the first Metadata File (v1.metadata.json) containing the schema and partition spec (days(order_date)).

    3. It registers the table in the Catalog (e.g., Glue, Hive), pointing it to v1.metadata.json.

2. Insert the Query

The INSERT operation adds new data. Iceberg ensures these new files are invisible until the write is fully complete.

  • SQL Example:

  • The Lifecycle of an Insert:

    1. Write Data: The engine writes the raw data into a new Parquet file (e.g., data/order_1001.parquet) in object storage.

    2. Write Manifest: The engine writes a Manifest File listing that Parquet file and its stats (min/max values).

    3. Commit: The engine creates a new Metadata File (v2.metadata.json) that includes this new manifest.

    4. Atomic Swap: The Catalog updates the pointer from v1 to v2. Only now is the data visible to readers.

3. Merge Query (The "Upsert")

The MERGE INTO statement is powerful for Data Lakehouses. It handles updates, deletes, and inserts in a single commandβ€”typically used for syncing Change Data Capture (CDC) streams.

  • SQL Example:

  • What happens behind the scenes:

    • In Copy-On-Write (COW): Iceberg identifies files with matching order_ids, rewrites those specific files with the updated values, and creates a new snapshot.

    • In Merge-On-Read (MOR): Iceberg keeps the old data files but writes a Delete File (tombstoning the old rows) and a new Data File (with the new values).


Part 2: Reading Queries in Apache Iceberg

Reading is efficient because the engine uses the metadata to skip ("prune") irrelevant data before it ever touches S3 or HDFS.

1. The SELECT Query

Standard analytical queries automatically leverage the metadata layer.

  • SQL Example:

  • The Lifecycle of a Select:

    1. Catalog Lookup: The engine asks the catalog for the current metadata location.

    2. Snapshot Read: It reads the current Metadata JSON to find the active Manifest List.

    3. Partition Pruning: It checks the Manifest List. If a manifest covers 2025-11, it is ignored. It only opens manifests for 2025-12.

    4. File Pruning: It looks inside the remaining Manifests. It checks column stats. If a file's min_date is 2025-12-25, it is included.

    5. Data Read: The engine reads only the specific Parquet files identified in step 4.

2. The Time-Travel Query

This allows you to query the table as it existed in the past, leveraging the Snapshot history stored in the Metadata JSON.

  • SQL Example (By Timestamp):

  • SQL Example (By Snapshot ID):

  • How it works:

    Instead of reading the current snapshot defined in the metadata file, the engine simply looks up the snapshot ID closest to the requested timestamp in the snapshots array of the Metadata JSON and reads that version of the table.


Visualizing the Flow

Here is a diagram summarizing the Insert Lifecycle discussed above, showing how the components we learned (Catalog, Metadata, Data) interact during a write.

spinner


Performance optimization


Last updated