What are Data Lakehouse and table formats?


Modern table formatsarrow-up-right


The Evolution of Data Architecture: From Warehouse to Lakehouse

Warmup: Before diving into the technical specifics of Iceberg, Delta, or Hudi, we need to understand the architectural pressure cooker that created them. This is the story of how we moved from expensive proprietary boxes to open swamps, and finally, to the modern Data Lakehouse.


Data Warehouse

The Data Warehouse (which is often built on top of an OLAP database) serves as the centralized repository for an organization's historical data. It is designed to ingest large volumes of information from operational systems, logs, and application databases to create a "single source of truth" for business intelligence.

Architectural Philosophy: The "Walled Garden"

Unlike the Data Lake, the traditional Data Warehouse is defined by tight coupling. It is a monolithic system where the vendor controls every layer of the stack.

  • Proprietary Storage: Data is stored in the vendor’s own optimized file and table formats. You cannot simply open these files with an external tool; they are effectively a "black box."

  • Exclusive Compute: Because the storage format is proprietary, the data can only be accessed using the warehouse's own compute engine.

  • Managed Optimization: The upside of this closed system is that the warehouse manages the data layout for you, handling indexing and optimization automatically to ensure high performance.

spinner

Technical components of a "traditional" Data Warehouse

Evolution: From On-Prem to the Cloud

1. The On-Premises Era (Pre-2015)

Early data warehouses were physical appliances where storage and compute were strictly bound together on the same nodes.

  • The Scaling Problem: If your storage needs grew, you had to buy more hardware, which inevitably meant paying for more compute power—even if you didn't need it. This lack of independent scaling made these systems rigid and expensive.

2. The Cloud Era (Post-2015)

The rise of cloud-native computing introduced a major shift. Modern cloud data warehouses finally allowed for the separation of resources, meaning users could scale storage and compute independently. You could now store petabytes of data cheaply and spin up compute clusters only when queries needed to run. However, despite this infrastructure flexibility, the data remained locked inside the vendor's proprietary formats.

The ML Gap

A major limitation highlighted in modern workflows is the warehouse's inability to support Advanced Analytics and Machine Learning (ML).

  • Structured Only: Warehouses struggle with unstructured data (images, text, JSON) which are critical for modern ML models.

  • The Export Problem: Since the warehouse engine isn't built for ML training, engineers are forced to export data out to other platforms. This creates data duplication and fragile pipelines, leading to "data drift" and model decay when the exported data falls out of sync with the source.

Trade-offs: The Data Warehouse Pros & Cons

Pros (The Benefits)

Cons (The Limitations)

High Performance: Because the system controls the file layout and indexing, it offers extremely fast query speeds on structured historical data.

Vendor Lock-in: The "closed architecture" means your data is trapped in proprietary formats. Migrating away or using external tools is difficult or impossible without exporting the data first.

Data Governance: It enforces strict schemas and security policies, ensuring data quality. Data written to the warehouse is validated, consistent, and reliable.

High Cost: Both storage and compute come at a premium. As workloads increase, costs can spiral quickly compared to object storage.

Single Source of Truth: It consolidates data from disparate sources, making it the gold standard for reporting and Business Intelligence (BI).

Rigid Workloads: It is strictly designed for relational (SQL) tasks. It cannot natively handle unstructured data or ML workflows effectively.


Data Lake

The Genesis of the Data Lake

The Data Lake emerged as a direct response to the limitations of traditional Data Warehouses. As data volumes exploded, warehouses became difficult to scale due to the tight coupling of storage and compute (you couldn't buy one without the other). Furthermore, warehouses were expensive and restricted to structured data, making them unsuitable for raw logs, sensor data, or messy inputs.

The Data Lake was designed to solve this by offering a low-cost repository capable of storing vast amounts of data in its native format, regardless of structure.

Evolution: From HDFS to the Cloud

The history of the Data Lake can be viewed in two distinct eras:

1. The Hadoop Era

Initially, Data Lakes were built on Hadoop using the HDFS file system across clusters of commodity hardware.

  • Processing: Analytics were run using MapReduce (Java-based), which was complex and verbose.

  • The SQL Bridge: To make this accessible to analysts, Hive was introduced. It translated SQL into MapReduce jobs and introduced the Hive Table Format, which allowed directories of files to be recognized logically as "tables."

2. The Cloud Era

The industry eventually shifted toward Cloud Object Storage (like S3, Azure Blob, GCS) due to ease of management and cost efficiency.

  • The Engine Shift: MapReduce was replaced by faster distributed query engines like Apache Spark, Presto, and Dremio.

  • The Bottleneck: While the storage medium changed, the Hive Table Format remained the standard. However, because Hive was designed for file systems, not object storage, it struggled with the latency of cloud network calls, leading to performance inefficiencies when listing files.

Architectural Philosophy: Decoupling

The defining characteristic of a Data Lake is the decoupling of storage and compute. unlike a warehouse where a single engine manages everything:

  • Storage is passive: There is no active "storage engine" optimizing data layout in the background. Optimization (like rewriting partitions) is a manual or ad-hoc process.

  • Compute is flexible: You can bring different engines to the same data. You might use Spark for heavy batch processing and Dremio for ad-hoc queries on the exact same files.

spinner

Trade-offs: The Data Lake Pros & Cons

Pros (The Benefits)

Cons (The Limitations)

Cost Efficiency: Significantly cheaper storage and compute compared to warehouses. Ideal for storing massive datasets where the immediate ROI isn't clear.

Performance Penalties: Because components are decoupled, you lose the "tightly integrated" optimizations of a warehouse (like native indexing). Achieving comparable speed requires heavy engineering.

Open Standards: Data is stored in open formats (e.g., Parquet, Avro, JSON). This prevents vendor lock-in and allows any tool to read the data.

Complexity (The "Glue" Problem): You have to manually wire together storage, file formats, and compute engines. This requires a significant amount of configuration and specialized Data Engineering talent.

Unstructured Support: Native ability to handle non-relational data, including images, logs, sensor streams, and unstructured text.

Lack of ACID Guarantees: Traditional lakes lack atomic transactions. They typically rely on "schema-on-read," meaning bad data is caught during processing rather than ingestion. This makes them risky for financial or compliance-heavy workloads.

The Dynamic Shift: We had cheap storage and flexible compute, but we lost the reliability, speed, and safety of the Warehouse. We needed a way to get Warehouse behavior on top of Lake architecture.


Hive: the original Table Format

chevron-rightMore about Hive Table Formathashtag

Hive Table Format: The SQL-on-Hadoop Revolution

Before Hive (2009), analyzing data in Hadoop required writing complex MapReduce jobs in Java. This excluded most data analysts. Facebook created Hive to solve this by allowing users to write SQL, which the framework would then convert into MapReduce jobs.

To make SQL work, the system needed a way to understand "what is a table" on a distributed file system. This led to the creation of the Hive Table Format and the Hive Metastore.

Architecture: The "Directory is the Table"

The defining characteristic of the Hive table format is that it defines a table as a directory.

  • The Metastore: A database that tracks the location of the table (e.g., /db1/table1).

  • The Layout: Any file dropped into that directory (or its subdirectories) is instantly considered part of the table.

  • Partitions: Subdirectories act as partitions (e.g., k1=A), allowing engines to skip entire folders during queries.

The Trade-Offs

The Benefits (Why it worked)

The Limitations (Why it failed at scale)

SQL Access: It democratized Big Data by enabling SQL on Hadoop.

Slow "List" Operations: To run a query, the engine must list all files in the directory. On cloud storage (S3), this is extremely slow for large tables.

Partitioning: It introduced partition pruning (skipping folders), which was much faster than full table scans.

No ACID: There is no "commit" mechanism. If a job crashes while writing files, users see partial/corrupt data.

Format Agnostic: It allowed users to swap file formats (e.g., moving from CSV to Parquet) without changing the table definition.

Inefficient Updates: You cannot update a single file. You must rewrite the entire partition to change one row (Atomic Swap at partition level only).

The Modern Shift: From Directories to Files

The root cause of Hive's limitations was its directory-based definition. Modern table formats (Iceberg, Delta Lake, Hudi) fixed this by changing the fundamental definition of a table. Instead of saying "The table is everything in this folder," they say "The table is this specific canonical list of files." This granular control unlocks ACID transactions, Time Travel, and safe concurrent writes.

Diagram: Hive Metastore & Partition Structure

This diagram visualizes how the Hive Metastore points to directories, and how those directories contain the actual data files.

spinner

Data Lakehouse

The Data Lakehouse is an architectural paradigm born from the desire to "thread the needle" between the high performance and governance of a Data Warehouse and the low cost, flexibility, and openness of a Data Lake.

Conceptually, a Lakehouse is a Data Lake that has been upgraded with warehouse-like capabilities. It retains the decoupled structure of a lake—using low-cost cloud object storage and open file formats (like Parquet)—but introduces a new layer that enables ACID transactions, schema enforcement, and high-speed indexing.

The Secret Sauce: The Table Format

The key innovation that makes a Lakehouse possible is the introduction of modern Table Formats.

A Table Format is an abstraction layer that sits between the compute engine and the raw data files. Its sole job is to authoritatively answer a simple but critical question: "Which specific files on storage currently belong to this table?"

In traditional systems (like Hive-based Data Lakes), this was managed loosely by directory listings—if a file was in the folder, it was considered part of the table. Modern open table formats (such as Apache Iceberg, Delta Lake, Apache Hudi, and Apache Paimon) replace this directory-based approach by maintaining a rich metadata layer alongside the data files.

Fundamentally, this redefines a "table" from being a dumb folder of files to becoming a canonical list of files managed by intelligent metadata. This metadata allows compute engines to interact with the dataset "intelligently," enabling features that were previously impossible on a data lake.

By tracking individual files rather than just folders, these formats enable specific technical primitives:

  • ACID Transactions: New data isn't "live" until the metadata says so. This allows writers to commit data atomically, ensuring readers never see partial or corrupt data.

  • Time Travel: Because the metadata tracks the list of files over time, it creates a history of snapshots. You can query the table as it existed yesterday simply by telling the engine to read the metadata file from yesterday.

  • High Performance (Pruning): The metadata includes file-level statistics (like min/max values for columns). A query engine can look at this metadata and determine, "I can skip 90% of these files because they don't contain the data the user asked for," drastically speeding up queries without touching the raw data.

More on Value Propositions of the Lakehouse

Beyond the technical mechanics, this architecture delivers significant business and operational value:

Benefit

Description

Fewer Copies, Less Drift

By bringing warehouse capabilities (like ACID updates) to the lake, you no longer need to ETL data into a separate warehouse for processing. This creates a single source of truth, reducing storage costs, compute costs, and the "governance nightmare" of out-of-sync data copies.

Faster Queries

The metadata layer allows compute engines to plan queries more efficiently. Techniques like file skipping (pruning) and advanced planning happen at the table format level, speeding up insights significantly compared to raw data lake queries.

Time Travel & Safety

Modern table formats work by creating immutable snapshots of data. This allows for "time travel"—querying the table as it existed at a specific point in the past—and provides an easy "undo" button for accidental bad writes, eliminating costly backfill operations.

Open Architecture

The entire stack is built on open standards (e.g., Apache Iceberg for tables, Apache Parquet for files). This prevents vendor lock-in, ensuring your data remains accessible to a wide ecosystem of tools, not just a single proprietary engine.

The Data Lakehouse Diagram

This diagram is designed to contrast directly with the previous two.

  • Like the Warehouse, it shows a structured flow where tools (including the Catalog) work together to provide governance and management.

  • Like the Lake, it remains open, with pluggable compute and storage layers.

  • The Key Difference: The Table Format is no longer a passive layer; it is the central, active "brain" that manages state, transactions, and history across the open components.

  • In an open Lakehouse, the "Storage Engine" logic is embedded inside the Compute Engine's library. When Spark runs an Iceberg job to compact files or expire snapshots, Spark acts as the storage engine.

spinner


Can the data lakehouse fully replace data warehouse in terms of functionality is provides?

The short answer is: Yes, functionally.

The long answer is: Functionally yes, but operationally no.

In 2024/2025, the Data Lakehouse has achieved functional parity with the Data Warehouse for about 90-95% of use cases. There is almost no SQL query, security policy, or transaction type you can do in a Warehouse that you cannot also do in a Lakehouse.

However, "functionality" isn't the only factor. The decision often comes down to complexity and latency.

The "Yes": Functional Parity

If you look at the core requirements of a Data Warehouse, modern Lakehouses (via Iceberg/Delta) now check every box:

Warehouse Requirement

Old Data Lake (Hive)

Modern Lakehouse (Iceberg/Delta)

ACID Transactions

❌ No

✅ Yes (Safe concurrent writes)

Mutations

❌ Hard (Rewrite everything)

✅ Yes (UPDATE, MERGE, DELETE)

Schema Enforcement

❌ No (Schema on Read)

✅ Yes (Reject bad data on write)

Time Travel

❌ No

✅ Yes (SELECT * AS OF ...)

Performance

❌ Slow (Scan everything)

✅ Fast (Z-Order, Partition Pruning, Min/Max skip)

The "No": Where the Warehouse Still Wins

While the Lakehouse can do these things, the Data Warehouse (like Snowflake native storage or BigQuery) often does them easier or faster in extreme scenarios.

  • Sub-Second Latency: If you need to serve a dashboard to 1,000 concurrent users with <200ms response time, a high-performance Warehouse (or a specialized real-time OLAP engine like ClickHouse) will still beat a Lakehouse. Lakehouses typically have a "cold start" overhead (listing files, reading metadata) that makes sub-second latency harder to guarantee.

  • "Out of the Box" Simplicity: A Data Warehouse is a vertically integrated appliance. You load data, you query it. It auto-optimizes.

    • In a Lakehouse, you are often responsible for maintenance: "Did I run compaction?" "Are my file sizes too small?" "Did I expire my old snapshots?" (Note: Managed services like Databricks or Snowflake Managed Iceberg are solving this, but it's still a factor).

The Feature Convergence

The market is converging. Data Warehouses are becoming Lakehouses (Snowflake supporting Iceberg), and Lakehouses are becoming Warehouses (Databricks supporting Serverless SQL).

Summary Comparison

Feature

Data Warehouse

Data Lakehouse

Can Lakehouse Replace?

Standard BI / Reporting

Excellent

Excellent

YES

Complex SQL Joins

Excellent

Excellent

YES

Data Governance (RBAC)

Mature

Mature (Unity/Polaris)

YES

Unstructured Data (AI)

Poor

Excellent

N/A (Lakehouse Wins)

Ad-Hoc Data Science

Poor

Excellent

N/A (Lakehouse Wins)

Low-Latency (<1s) APIs

Excellent

Good (Improving)

MAYBE (Depends on SLA)

Management Effort

Low (SaaS)

Medium (Configurable)

NO (Requires more skills)

The Verdict for Data Engineers

You can absolutely replace a Data Warehouse with a Lakehouse today, and many companies (like Netflix, Uber, and Airbnb) run entirely on Lakehouse architectures.

You should replace your Warehouse if:

  • You have massive data volumes where Warehouse storage costs are prohibitive.

  • You have a strong Data Engineering team capable of tuning file layouts (compaction/sorting).

  • Your users are technical (Data Scientists/Engineers) and need direct file access.

You should keep a Warehouse if:

  • Your team is small and just wants "SQL that works" without managing file maintenance.

  • Your primary use case is serving low-latency dashboards to external customers.


Last updated