# Data Compaction

***

{% embed url="<https://dagster.io/glossary/data-compaction>" %}

***

I'm assuming you already have a background in LSM-Trees and storage formats, let's frame Compaction specifically as a maintenance and optimization process. In a world where we "never overwrite" (like in LSM-Trees, Delta Lake, or Iceberg), Compaction is the only thing preventing the system from eventually collapsing under its own weight.

***

### Data Compaction

**Compaction** is a background maintenance process that merges multiple data files into a smaller number of larger, more organized files. While **Compression** reduces the size of a single file, Compaction manages the structural health of the entire dataset.

{% embed url="<https://claude.ai/public/artifacts/35b47829-1d2d-4518-825e-7ba69a814da9>" %}

#### The Why: Why do we need Compaction?

In modern data systems (especially those using LSM-Trees or Cloud Object Storage), we follow an **immutable** pattern: we append new data rather than updating existing data in place. This leads to three major problems:

* The **"Small File Problem"**: Thousands of tiny files (common in streaming ingestion) overwhelm the Metadata layer and increase the overhead of opening/closing files during a query.
* **Space Amplification:** Multiple versions of the same record exist across different files. The old versions are "dead" but still taking up disk space.
* **Read Amplification**: To find a single record, the engine has to check many different files (fragments) to ensure it has the latest version.

***

#### Compaction in LSM-Trees (Database Level)

In databases like **Cassandra, RocksDB**, or **BigTable**, compaction is the "heartbeat" of the storage engine.

* **Merging**: It takes several sorted runs (SSTables) and merges them into one larger sorted run. Because the inputs are already sorted, this is a very efficient **Merge Sort** operation.
* **Tombstone Removal:** It identifies records marked for deletion (Tombstones). If the compaction process reaches the "oldest" level, it can safely delete these records forever to reclaim space.
* **Leveled vs. Size-Tiered:** \* *Size-Tiered:* Merges files of similar size. Good for write-heavy workloads.
  * *Leveled:* Organizes files into distinct "levels." Level 1 is 10x larger than Level 0, etc. This is better for read-heavy workloads because it limits the number of files a read has to check.

***

#### Compaction in Data Lakes (Engineering Level)

In a Data Lake (using formats like **Parquet** in **Apache Iceberg** or **Delta Lake**), compaction is an "Asynchronous" task that you, the Data Engineer, often have to trigger or configure.

* **Bin-Packing**: The system takes thousands of 1MB Parquet files and combines them into 128MB or 512MB files. This is the optimal size for HDFS and S3 throughput.
* **Sorting/Clustering**: During compaction, the data can be re-sorted by a high-frequency filter column (like `timestamp` or `user_id`). This makes **Predicate Pushdown** significantly more effective.
* **Metadata Cleanup:** Compaction allows the system to remove old entries from the manifest files, making the "metadata scan" phase of a query much faster.

***

**Compaction vs. Compression**

| **Feature** | **Data Compression**                              | **Data Compaction**                                           |
| ----------- | ------------------------------------------------- | ------------------------------------------------------------- |
| **Input**   | Raw bits/values.                                  | Multiple files/fragments.                                     |
| **Output**  | An encoded version of the same data.              | Fewer, larger, cleaned-up files.                              |
| **Handles** | Data redundancy (e.g., repeating strings).        | Logical redundancy (e.g., old versions/deletes).              |
| **Benefit** | Saves storage cost and I/O bandwidth.<sup>5</sup> | Saves metadata overhead and reduces read latency.<sup>6</sup> |

#### Engineering Takeaway

> "Compression is a feature of the **File Format** (Parquet/Avro). Compaction is a feature of the **Storage Engine/Table Format** (LSM-Trees/Iceberg)."

***
