Features and benefits
Improved Performance
Eliminating File Listings: In legacy formats (like Hive), determining which files to read required listing directories in the file system (e.g., S3
lsoperations). This is slow and expensive (O(N) cost) when dealing with thousands of files. Iceberg avoids this entirely. It reads the Manifest List and Manifest Files to get the exact file paths. This makes planning a query on a petabyte-scale table nearly as fast as planning a query on a small table.Data Pruning: The metadata stores detailed statistics (min/max values) for every column in every data file. This allows the query engine to skip (prune) entire files that don't contain the data relevant to the query, drastically reducing I/O.
Partition Evolution
Logical Updates, No Rewrites: In traditional systems, changing a partition scheme (e.g., switching from "partition by Month" to "partition by Day") required a massive migration: reading the entire table and rewriting it into a new directory structure.
How it works: Iceberg handles partitioning logically in the metadata. You can update the partition spec at any time. Old data remains partitioned by "Month," and new data is partitioned by "Day." The query engine automatically handles this split complexity, so the user never sees the difference.
Hidden Partitioning
The "Hive" Problem: In Hive, users had to know the physical directory structure to get performance. If a table was partitioned by
order_date_month, a user queryingWHERE order_date = '2023-01-01'would get a full table scan unless they explicitly addedAND order_date_month = '2023-01'.The Iceberg Solution: Iceberg records the relationship between the column (
order_date) and the partition transformation (month(order_date)). Users simply writeWHERE order_date = '2023-01-01', and Iceberg automatically calculates the partition hash/value to prune the correct files. The physical layout is "hidden" from the user.
The Old Way (Hive Style)
The Problem: You must physically create a separate column for the partition (e.g., order_month) and the user must remember to use it.
-- DDL: You have to manually create a separate column for partitioning
CREATE TABLE hive_orders (
id BIGINT,
amount DECIMAL(10, 2),
order_date TIMESTAMP,
order_month STRING -- <--- Extra column purely for partitioning
)
PARTITIONED BY (order_month);
-- THE "BAD" QUERY (Full Table Scan)
-- The engine scans every folder because it doesn't know 'order_date' is related to the partition folders.
SELECT * FROM hive_orders
WHERE order_date >= '2023-01-05 00:00:00';
-- THE "REQUIRED" QUERY (Fast but tedious)
-- The user must manually calculate and add the partition filter.
SELECT * FROM hive_orders
WHERE order_date >= '2023-01-05 00:00:00'
AND order_month = '2023-01'; -- <--- User burdenThe Iceberg Solution (Hidden Partitioning)
The Solution: You define the partition strategy once in the table definition. The user queries the data naturally, and Iceberg handles the math.
Compatibility Mode (Object Store Layout)
The S3 Throttling Issue: Standard file systems often write files using sequential prefixes (e.g.,
s3://bucket/data/2023/01/01/...). On object storage like S3, writing too many files to the same prefix can trigger throttling limits, slowing down ingestion.The Solution: Iceberg supports an "Object Store" layout strategy. It hashes the beginning of file paths (e.g.,
s3://bucket/data/a1b2-2023...) to distribute files evenly across the storage system's shards. This maximizes write throughput on cloud storage. Tools like Dremio often handle these optimizations automatically.
Explanation
The Problem: The "Sequential" Traffic Jam
On your laptop, if you create a folder called 2023/01/01/ and put 1,000 files in it, your hard drive doesn't care.
In Amazon S3, however, the "folder path" (called a Prefix) determines which physical server handles the traffic. S3 assigns servers based on the alphabetical order of the file paths.
The Scenario: You are writing data for
2023/01/01.The Paths:
s3://bucket/data/2023/01/01/file_1.parquets3://bucket/data/2023/01/01/file_2.parquet...
s3://bucket/data/2023/01/01/file_1000.parquet
The Issue: Because all these files start with the exact same text (2023/01/01), S3 sends all the write traffic to the same single physical partition. That partition gets overwhelmed by the traffic spike and starts rejecting requests (Throttling / 503 Errors
To fix this, we need to trick S3 into thinking these files are completely unrelated so it sends them to different servers. We do this by adding a random "Hash" (a jumble of characters) to the start of the filename.
Iceberg "Object Store Layout" Strategy: Instead of writing to 2023/01/01/..., Iceberg calculates a hash and writes the file to:
s3://bucket/data/a1b2-2023/01/01/file_1.parquet(Goes to Server A)s3://bucket/data/9z8x-2023/01/01/file_2.parquet(Goes to Server B)s3://bucket/data/4f5g-2023/01/01/file_3.parquet(Goes to Server C)
By adding that random prefix (a1b2, 9z8x), the files are alphabetically far apart. S3 sees them as different "folders" and spreads the write load across massive numbers of servers in parallel.
Why Iceberg is Special (Hidden Complexity)
If you did this manually, your data would be a mess. You would never be able to find "January data" because it would be scattered across thousands of random folders like a1b2-... and 9z8x-....
The Iceberg Magic:
Physical Layout: The files are scattered everywhere with random names to keep S3 happy and fast.
Logical View: The Iceberg Metadata keeps a neat list. When you query
WHERE date = '2023-01-01', Iceberg looks at its manifest, finds the specific scattered files, and brings them back to you. You (the user) never see the messy hashes; you just see a clean, fast table.
Time Travel
Query History: Because Iceberg maintains a log of Snapshots, you can query the table exactly as it existed at any point in the past.
Syntax: Users can query using timestamps or snapshot IDs (e.g.,
SELECT * FROM orders FOR SYSTEM_TIME AS OF '2023-01-01 10:00:00').Use Cases: This is critical for reproducing machine learning models (training on the exact same dataset used last month), auditing data changes, or debugging "bad" data injection.
Table Versioning & Rollback
Instant Recovery: If a bad ETL job accidentally corrupts the data or deletes the wrong rows, you don't need to restore from a backup tape.
Snapshot Rollback: You can instantly "rollback" the table state to the previous snapshot. This is a metadata-only operation (pointing the catalog back to the previous JSON file), so it takes seconds regardless of table size.
Multi-Cloud Support
Storage Agnostic: Iceberg separates "Compute" (Spark, Dremio, Trino) from "Storage."
Portability: You can store your Iceberg files in AWS S3, Azure Data Lake Storage (ADLS), Google Cloud Storage (GCS), or MinIO. Because the format is open standard, you can move data between clouds without proprietary lock-in.
Compression and Serialization
Best of Both Worlds: Iceberg utilizes the most efficient formats for specific tasks.
Metadata (Avro): Uses Avro for Manifest lists because it is row-oriented and compact, allowing for fast serialization of file statistics.
Data (Parquet/ORC): Uses Parquet (default) for data files to leverage columnar compression (Snappy, Zstd, Gzip). This ensures data takes up minimum space and maximizes scan speed for analytics.
Last updated