Update patterns


Updating persisted data is one of the hardest parts of data engineering, especially when teams move between technologies. Early data lakes did not account for updates at all, which led to inefficient pipelines that recomputed large datasets from scratch. Modern lakehouses fully embrace updates—driven in part by regulatory requirements (GDPR) and the need to maintain accurate, current datasets.

Several update patterns exist, each with trade-offs:

Truncate-and-Reload (aka Full Refresh or Destructive Load)

  • How it works: You wipe the destination table completely (TRUNCATE) and reload the entire dataset from the source every time.

  • Best for: Small reference tables (e.g., a list of US States) where data volume is low and consistency is prioritized over speed.

    • Simple and reliable but expensive for large datasets.

  • Trade-off: Very simple to implement (no complex merge logic), but expensive and slow as data grows.

  • Still used for slowly changing datasets or nightly batches.


Insert-Only

  • New records are appended; old ones are never modified.

  • The “current state” must be reconstructed by selecting the newest record per key.

  • Can be costly during reads unless a materialized view or secondary table maintains the current snapshot.

  • Avoid single-row inserts in columnar systems—use micro-batches instead.

  • Works well with systems like BigQuery or Druid’s streaming buffer.


Delete Patterns

  • Hard delete: permanently remove records (often required by GDPR).

  • Soft delete: mark records as deleted.

  • Insert deletion: append a new version of the record with a delete flag.

  • Deletes are usually expensive in columnar and file-based systems.


Upsert / Merge

  • Combines updates and inserts depending on whether a matching key exists.

  • Merges originated in row-based databases, where in-place updates are natural.

  • File-based systems cannot update in place—they rely on copy-on-write (COW) to rewrite affected file chunks.

  • Upserts are often the most expensive update pattern in distributed columnar storage.

  • Still, large batch merges may outperform transactional row stores.

Performance depends heavily on:

  • Partitioning strategy

  • Clustering / sorting of files

  • Minimizing merge frequency

  • Avoiding row-level real-time merges

Many teams struggle by running CDC-based, per-record merges in systems not designed for them.


Note about careful approach to update patterns

Modern data engineering pipelines require careful planning around how data is updated. Update patterns—truncate/reload, insert-only, deletes, and merges—must be matched to the capabilities of the underlying storage system. Lakehouse table formats and CDC ingestion strategies now make updates feasible, but merge frequency, file layout, and partitioning remain critical to achieving good performance.


Last updated