Data Compaction
I'm assuming you already have a background in LSM-Trees and storage formats, let's frame Compaction specifically as a maintenance and optimization process. In a world where we "never overwrite" (like in LSM-Trees, Delta Lake, or Iceberg), Compaction is the only thing preventing the system from eventually collapsing under its own weight.
Data Compaction
Compaction is a background maintenance process that merges multiple data files into a smaller number of larger, more organized files. While Compression reduces the size of a single file, Compaction manages the structural health of the entire dataset.
The Why: Why do we need Compaction?
In modern data systems (especially those using LSM-Trees or Cloud Object Storage), we follow an immutable pattern: we append new data rather than updating existing data in place. This leads to three major problems:
The "Small File Problem": Thousands of tiny files (common in streaming ingestion) overwhelm the Metadata layer and increase the overhead of opening/closing files during a query.
Space Amplification: Multiple versions of the same record exist across different files. The old versions are "dead" but still taking up disk space.
Read Amplification: To find a single record, the engine has to check many different files (fragments) to ensure it has the latest version.
Compaction in LSM-Trees (Database Level)
In databases like Cassandra, RocksDB, or BigTable, compaction is the "heartbeat" of the storage engine.
Merging: It takes several sorted runs (SSTables) and merges them into one larger sorted run. Because the inputs are already sorted, this is a very efficient Merge Sort operation.
Tombstone Removal: It identifies records marked for deletion (Tombstones). If the compaction process reaches the "oldest" level, it can safely delete these records forever to reclaim space.
Leveled vs. Size-Tiered: * Size-Tiered: Merges files of similar size. Good for write-heavy workloads.
Leveled: Organizes files into distinct "levels." Level 1 is 10x larger than Level 0, etc. This is better for read-heavy workloads because it limits the number of files a read has to check.
Compaction in Data Lakes (Engineering Level)
In a Data Lake (using formats like Parquet in Apache Iceberg or Delta Lake), compaction is an "Asynchronous" task that you, the Data Engineer, often have to trigger or configure.
Bin-Packing: The system takes thousands of 1MB Parquet files and combines them into 128MB or 512MB files. This is the optimal size for HDFS and S3 throughput.
Sorting/Clustering: During compaction, the data can be re-sorted by a high-frequency filter column (like
timestamporuser_id). This makes Predicate Pushdown significantly more effective.Metadata Cleanup: Compaction allows the system to remove old entries from the manifest files, making the "metadata scan" phase of a query much faster.
Compaction vs. Compression
Feature
Data Compression
Data Compaction
Input
Raw bits/values.
Multiple files/fragments.
Output
An encoded version of the same data.
Fewer, larger, cleaned-up files.
Handles
Data redundancy (e.g., repeating strings).
Logical redundancy (e.g., old versions/deletes).
Benefit
Saves storage cost and I/O bandwidth.5
Saves metadata overhead and reduces read latency.6
Engineering Takeaway
"Compression is a feature of the File Format (Parquet/Avro). Compaction is a feature of the Storage Engine/Table Format (LSM-Trees/Iceberg)."
Last updated
