Apache Iceberg
Best for batch workloads
On demand videos about Apache Iceberg
What is a Data Lakehouse and What is a Table Format? The Architecture of Apache Iceberg, Apache Hudi and Delta Lake The Read and Write Process for Apache Iceberg Tables Understanding Apache Icebergβs Partitioning Features Optimizing Apache Iceberg Tables Streaming with Apache Iceberg The Role of Apache Iceberg Catalogs Versioning with Apache Iceberg Ingesting Data into Apache Iceberg with Apache Spark Ingesting Data into Apache Iceberg with Dremio
Apache Iceberg's distinctive architecture goes beyond addressing Hive's limitations, enabling a wide range of advanced capabilities for data lakes and lakehouse environments. This overview introduces Iceberg's core features.
Apache Iceberg Architecture
Apache Iceberg uses a three-layer metadata architecture that separates the data files from their organization and management:
Diagram of Apache Iceberg architecture
Data Layer: Data and Delete files
Composition: Contains the raw records stored in open file formats (Parquet, ORC, or Avro) alongside Delete Files. While format-agnostic, Apache Parquet is the industry standard due to its superior columnar compression and query performance.
Function: Serves as the physical repository for the information query engines retrieve to satisfy user requests.
Immutability: Files are write-once (immutable). Once written to storage (S3, ADLS, GCS, or HDFS), they are never modified in place; updates or deletes result in new files being created.
Logical Management: Data is managed logically via metadata rather than physical directory structures. Operations like partitioning or re-organizing data do not require expensive physical file movements, only metadata updates.
Data Files
What are they?
Data files contain the actual content of your tableβthe rows of logs, transaction history, or customer details you are analyzing. Unlike the metadata files (which use Avro and JSON), data files use standard open-source formats optimized for big data processing:
Apache Parquet: (Most common) Columnar storage, best for analytics and querying specific columns.
Apache Avro: Row-based storage, best for heavy write loads or streaming.
Apache ORC: Optimized Row Columnar, often used in Hive environments.
Key Characteristics of Data Files
Immutable (Write-Once): This is a critical database concept. Once a data file is written to storage (like S3 or HDFS), it is never opened and modified.
If you update a row: Iceberg does not edit the file. Instead, it writes a new data file containing the updated row (Copy-On-Write) or writes a special "Delete File" (Merge-On-Read) to flag the old record as removed.
Partition Aware: A single data file never spans across multiple partitions. If your table is partitioned by
Month, a file will strictly contain data for only "January" or only "February". This allows the query engine to ignore huge chunks of files that don't match a query's time range (Partition Pruning).Invisible by Default: Just because a data file exists in your storage bucket doesn't mean it is part of the table. It is only "visible" to a query if it is actively tracked by a Manifest File in the current Snapshot. This allows for "safe" writesβyou can upload 1,000 data files, but nobody sees them until you commit the metadata update that points to them.
Apache Iceberg tries to be flexible in terms of file formats. That has 3 reasons:
Organizational Reality: It accommodates legacy decisions and diverse teams that may have standardized on different formats over time.
Workload Optimization: Users can match the format to the specific task. For example, Parquet is typically used for large-scale analytical (OLAP) queries, while Avro is often preferred for low-latency streaming and write-heavy tasks.
Future-Proofing: Iceberg is architected to easily adopt new, superior file formats that may emerge in the future.
The Dominance of Apache Parquet Despite mentioned flexibility
Parquet is the industry standard for Iceberg tables. Its popularity stems from its columnar structure, which unlocks massive performance gains for analytical workloads.
Parallelism: Files can be split into multiple parts, allowing engines to read them in parallel.
Efficiency: Columnar storage allows for superior compression (smaller storage footprint) and higher read throughput.
Pruning: Parquet stores statistics (like min/max values) that allow query engines to "skip" chunks of data that don't match the query filter.
The "Hidden" Data Files: Delete Files
In Iceberg v2 tables, there is a special type of data file called a Delete File.
Role: Instead of rewriting massive Parquet files just to remove one row, Iceberg writes a small "Delete File" that simply lists which rows in the existing data files should be ignored.
As previously discussed, data files in object storage are immutableβthey cannot be edited in place. When a row needs to be deleted or updated, Iceberg must write a new file. There are two strategies for this:
Copy-On-Write (COW): The system rewrites the entire data file (minus the deleted row) into a new data file. This means that Copy-On-Write (COW) does not use delete files.
Merge-On-Read (MOR): The system keeps the original data file exactly as it is. Instead, it writes a small Delete File that lists which records should be ignored.
Delete Files are the core mechanism of the Merge-On-Read (MOR) strategy. When a query engine reads the table, it reads the original data plus the delete files, merging them on the fly to hide the deleted rows.
Note: Delete files are only supported in Iceberg v2 format tables.
Types of Delete Files
There are two distinct ways Iceberg identifies which rows to remove: Positional and Equality delete files.
Let's consider a scenario to demonstrate how they work.
The Scenario: User Activity Logs
Imagine you manage a table called user_logs that tracks actions on a website.
Schema:
user_id,event_type,timestamp,page_url.Context: You have millions of rows stored in Parquet files.
1. Positional Delete Files (The "Map" Approach)
Concept: You identify the exact physical coordinates (File Path + Row Number) of the row you want to remove. This usually happens after a system has already scanned the data and found exactly which rows are "bad."
New Example: Your ingestion job accidentally wrote a duplicate log entry. You run a deduplication job that finds the exact duplicate is located in
file_log_v1.parquetat row index5,092.The Delete File: Iceberg writes a file containing a specific pointer:
path: "s3://.../file_log_v1.parquet", position: 5092.Read Process: The engine opens
file_log_v1.parquet. It reads rows 0 to 5,091 normally. It sees the "tombstone" marker for row 5,092 and skips it entirely. It continues reading from 5,093.
This method: Slower to write (must read data to find the index), but fast to read (engine just skips that index).
Equality Delete Files (The "Match" Approach)
Concept: You define a rule (a filter), and the engine deletes anything that matches that rule. You don't know (or care) where the data physically lives; you just want that specific data gone.
New Example: A user with
user_id: "user_555"requests to have their data deleted (GDPR "Right to be Forgotten").The Delete File: Instead of rewriting terabytes of log files, Iceberg writes a small metadata file that says: "Ignore any row where
user_idis equal to 'user_555'."Read Process: When an analyst queries the logs, the engine reads the data files and filters out every record belonging to
user_555on the fly.
This method: Fast to write (no need to read data first), but slower to read (engine checks every row against the rule).
Summary Comparison
Feature
Positional Deletes
Equality Deletes
Identifies Row By
Physical location (File Path + Row Index)
Data Value (id = 1234)
Write Cost
Higher: Must read the data first to find the exact row number.
Lower: Just writes the value (e.g., "Delete ID 1234") without reading current data.
Read Cost
Lower: Engine knows exactly which bits to skip.
Higher: Engine must compare every row against the delete values.
Metadata Layer: Architecture & File Formats
The Metadata Layer acts as the "brain" of an Apache Iceberg table. It is a hierarchical tree structure that tracks every data file, delete file, and operational history. Crucially, these files are colocated with your data in object storage, enabling features like Time Travel and Schema Evolution without needing a central database.
Below is the hierarchy from the root down to the leaves, including the rationale for the file formats chosen at each level.
1. Metadata File (.json)
Format: JSON (e.g.,
00001-metadata.json)Why JSON? The root metadata is stored in JSON to be human-readable and easily parseable by any client or system. It serves as the "entry point" to the table, and using a text-based format makes debugging and external tooling significantly easier.
Role: This file defines the table's "identity." It stores the schema, partition definitions, and a list of all Snapshots (versions of the table).
Atomic Updates:
Every time you write to the table, a new Metadata JSON is created.
The catalog updates a pointer to this new file in a single atomic action.
Benefit: This creates a linear, immutable history. Writers don't block readers, and if multiple engines write simultaneously, the "winner" is determined by which one successfully swaps this pointer.
2. Manifest List (.avro)
Format: Apache Avro
Why Avro? As tables grow, they can generate thousands of manifest files. Avro is used here because it is a compact, binary format. It allows the query engine to efficiently scan the list and perform fast filtering of entire manifests based on partition ranges without parsing bulky text.
Role: The Snapshot Index. This file represents a single Snapshot (point in time). It lists all the Manifest Files that make up the table at that specific moment.
Key Data: It stores partition-level stats (e.g., "This manifest only has data for
partition=2024-01").Benefit: The query engine reads this list first and enables Partition Pruning. If a user queries "January data," the engine skips any Manifest that tracks "February data" without even opening it.
3. Manifest Files (.avro)
Format: Apache Avro
Why Avro? Similar to the Manifest List, Avro is chosen for its row-oriented binary structure. It allows the engine to efficiently scan file statistics (like column bounds) to decide which specific data files to retrieve, minimizing I/O overhead.
Role: The Detailed Inventory. These are the leaves of the metadata tree. They contain the list of actual file paths (Data and Delete files).
Separation of Duties: While Manifests track both file types, they never mix them. A single manifest file will contain only Data Files or only Delete Files.
The "Hive" Problem vs. Iceberg Solution:
Legacy (Hive): Statistics were expensive. You wrote the data, then had to run a separate, heavy read job to calculate stats. Because it was slow, stats were often stale.
Iceberg: Statistics are "Lightweight." The engine calculates stats (min/max/counts) in memory while it writes the data file and saves them immediately to the Manifest.
Result: Iceberg stats are always up-to-date, allowing the engine to prune unnecessary files efficiently.
4. Puffin Files (.puffin)
Format: Binary (Puffin)
Why Puffin? Standard Manifest stats (min/max) are great for filtering but insufficient for complex aggregations like "Count Distinct." Puffin files are designed to hold "Blobs" of arbitrary binary data, specifically statistical sketches.
Role: The Advanced Stats Sidecar.
Usage: Currently, Iceberg supports Theta Sketches (from the Apache DataSketches library).
Benefit: It provides fast, approximate answers (e.g., "99.8% accurate count of unique users") for dashboarding and heavy aggregations, avoiding the need to load millions of raw rows into memory.
Visualizing the Metadata Architecture
This diagram illustrates the full hierarchy, highlighting the specific file formats (.json vs .avro) and the separation of Data vs. Delete manifests.
The Catalog Layer: The Entry Point
If the Metadata Layer is the "brain" and the Data Layer is the "body," the Catalog Layer is the "Address Book."
Anyone reading from a tableβwhether a single user or a massive distributed engineβneeds to know where to start. With thousands of files in storage (metadata JSONs, Avros, Parquets), finding the current state of the table is impossible without a map. The Catalog serves as this central repository.
The Core Requirement: Atomic Swaps
The primary technical requirement for any system serving as an Iceberg Catalog is that it must support atomic operations for updating the current metadata pointer.
Why is this non-negotiable? In a data lake, you might have multiple writers (e.g., a Spark streaming job and an Airflow batch job) trying to update the table simultaneously.
The Mechanism: The Catalog stores a reference (a pointer) to the current metadata file (e.g.,
v2.metadata.json). When a write operation finishes, it asks the Catalog to swap that pointer to the new metadata file (v3.metadata.json).The Guarantee: This swap happens in a single, atomic step. All readers and writers see the same state of the table at any given instant. You never end up with a "half-updated" table.
Catalog Backends & Pointer Storage
Because the only real requirement is storing a pointer and providing atomic guarantees, many different backends can serve as a Catalog. However, how they store this pointer varies:
Hadoop Catalog (File-System Based):
Pointer Mechanism: It relies on the file system itself. It writes a file called
version-hint.textin the table's metadata folder.Content: This file contains the version number of the current metadata file.
Note: This is common when using Amazon S3 or HDFS directly as the catalog.
Hive Metastore:
Pointer Mechanism: It uses the existing Hive table entry.
Content: It stores the full path to the current metadata file in a specific table property (often
metadata_location).
Project Nessie:
Pointer Mechanism: Nessie is designed for "Git-like" data versioning.
Content: The table entry has a property called
metadataLocationthat tracks the current file.
AWS Glue Catalog:
Pointer Mechanism: Similar to Hive, it stores the metadata pointer in the table definition properties within the Glue service.
Inspecting the Catalog (SQL Example)
You can query the catalog to see exactly where it is pointing. If you are using a catalog like AWS Glue, you can inspect the system tables to see the history of these pointers.
Example Query:
Let's imagine we are inspecting a customer_transactions table.
Output Analysis:
The result gives you the precise location the engine will read next.
Timestamp
Metadata File
Latest Snapshot ID
2025-12-22 14:30:00
s3://finance-bucket/transactions/metadata/00045-8a9b...metadata.json
9876543210...
What this tells us:
The Catalog is currently pointing to file version 00045.... Even if file 00046... exists in the bucket (perhaps from a failed write or an ongoing transaction), the engine will ignore it because the Catalog hasn't "blessed" it yet.
How Apache Iceberg performs read/write operations

Part 1: Writing Queries in Apache Iceberg
Writing to an Iceberg table isn't just dumping files into a folder. It triggers a specific chain of events in the metadata layer to ensure ACID compliance.
1. Create the Table
This initializes the table structure. Unlike a standard file system operation, this creates the first layer of the metadata tree.
SQL Example:
What happens behind the scenes:
The engine validates the schema.
It creates the first Metadata File (
v1.metadata.json) containing the schema and partition spec (days(order_date)).It registers the table in the Catalog (e.g., Glue, Hive), pointing it to
v1.metadata.json.
2. Insert the Query
The INSERT operation adds new data. Iceberg ensures these new files are invisible until the write is fully complete.
SQL Example:
The Lifecycle of an Insert:
Write Data: The engine writes the raw data into a new Parquet file (e.g.,
data/order_1001.parquet) in object storage.Write Manifest: The engine writes a Manifest File listing that Parquet file and its stats (min/max values).
Commit: The engine creates a new Metadata File (
v2.metadata.json) that includes this new manifest.Atomic Swap: The Catalog updates the pointer from
v1tov2. Only now is the data visible to readers.
3. Merge Query (The "Upsert")
The MERGE INTO statement is powerful for Data Lakehouses. It handles updates, deletes, and inserts in a single commandβtypically used for syncing Change Data Capture (CDC) streams.
SQL Example:
What happens behind the scenes:
In Copy-On-Write (COW): Iceberg identifies files with matching
order_ids, rewrites those specific files with the updated values, and creates a new snapshot.In Merge-On-Read (MOR): Iceberg keeps the old data files but writes a Delete File (tombstoning the old rows) and a new Data File (with the new values).
Part 2: Reading Queries in Apache Iceberg
Reading is efficient because the engine uses the metadata to skip ("prune") irrelevant data before it ever touches S3 or HDFS.
1. The SELECT Query
Standard analytical queries automatically leverage the metadata layer.
SQL Example:
The Lifecycle of a Select:
Catalog Lookup: The engine asks the catalog for the current metadata location.
Snapshot Read: It reads the current Metadata JSON to find the active Manifest List.
Partition Pruning: It checks the Manifest List. If a manifest covers
2025-11, it is ignored. It only opens manifests for2025-12.File Pruning: It looks inside the remaining Manifests. It checks column stats. If a file's
min_dateis2025-12-25, it is included.Data Read: The engine reads only the specific Parquet files identified in step 4.
2. The Time-Travel Query
This allows you to query the table as it existed in the past, leveraging the Snapshot history stored in the Metadata JSON.
SQL Example (By Timestamp):
SQL Example (By Snapshot ID):
How it works:
Instead of reading the current snapshot defined in the metadata file, the engine simply looks up the snapshot ID closest to the requested timestamp in the snapshots array of the Metadata JSON and reads that version of the table.
Visualizing the Flow
Here is a diagram summarizing the Insert Lifecycle discussed above, showing how the components we learned (Catalog, Metadata, Data) interact during a write.
Performance optimization
Last updated