Data Compaction



I'm assuming you already have a background in LSM-Trees and storage formats, let's frame Compaction specifically as a maintenance and optimization process. In a world where we "never overwrite" (like in LSM-Trees, Delta Lake, or Iceberg), Compaction is the only thing preventing the system from eventually collapsing under its own weight.


Data Compaction

Compaction is a background maintenance process that merges multiple data files into a smaller number of larger, more organized files. While Compression reduces the size of a single file, Compaction manages the structural health of the entire dataset.

The Why: Why do we need Compaction?

In modern data systems (especially those using LSM-Trees or Cloud Object Storage), we follow an immutable pattern: we append new data rather than updating existing data in place. This leads to three major problems:

  • The "Small File Problem": Thousands of tiny files (common in streaming ingestion) overwhelm the Metadata layer and increase the overhead of opening/closing files during a query.

  • Space Amplification: Multiple versions of the same record exist across different files. The old versions are "dead" but still taking up disk space.

  • Read Amplification: To find a single record, the engine has to check many different files (fragments) to ensure it has the latest version.


Compaction in LSM-Trees (Database Level)

In databases like Cassandra, RocksDB, or BigTable, compaction is the "heartbeat" of the storage engine.

  • Merging: It takes several sorted runs (SSTables) and merges them into one larger sorted run. Because the inputs are already sorted, this is a very efficient Merge Sort operation.

  • Tombstone Removal: It identifies records marked for deletion (Tombstones). If the compaction process reaches the "oldest" level, it can safely delete these records forever to reclaim space.

  • Leveled vs. Size-Tiered: * Size-Tiered: Merges files of similar size. Good for write-heavy workloads.

    • Leveled: Organizes files into distinct "levels." Level 1 is 10x larger than Level 0, etc. This is better for read-heavy workloads because it limits the number of files a read has to check.


Compaction in Data Lakes (Engineering Level)

In a Data Lake (using formats like Parquet in Apache Iceberg or Delta Lake), compaction is an "Asynchronous" task that you, the Data Engineer, often have to trigger or configure.

  • Bin-Packing: The system takes thousands of 1MB Parquet files and combines them into 128MB or 512MB files. This is the optimal size for HDFS and S3 throughput.

  • Sorting/Clustering: During compaction, the data can be re-sorted by a high-frequency filter column (like timestamp or user_id). This makes Predicate Pushdown significantly more effective.

  • Metadata Cleanup: Compaction allows the system to remove old entries from the manifest files, making the "metadata scan" phase of a query much faster.


Compaction vs. Compression

Feature

Data Compression

Data Compaction

Input

Raw bits/values.

Multiple files/fragments.

Output

An encoded version of the same data.

Fewer, larger, cleaned-up files.

Handles

Data redundancy (e.g., repeating strings).

Logical redundancy (e.g., old versions/deletes).

Benefit

Saves storage cost and I/O bandwidth.5

Saves metadata overhead and reduces read latency.6

Engineering Takeaway

"Compression is a feature of the File Format (Parquet/Avro). Compaction is a feature of the Storage Engine/Table Format (LSM-Trees/Iceberg)."


Last updated