Update patterns
Updating persisted data is one of the hardest parts of data engineering, especially when teams move between technologies. Early data lakes did not account for updates at all, which led to inefficient pipelines that recomputed large datasets from scratch. Modern lakehouses fully embrace updates—driven in part by regulatory requirements (GDPR) and the need to maintain accurate, current datasets.
Several update patterns exist, each with trade-offs:
Truncate-and-Reload (aka Full Refresh or Destructive Load)
How it works: You wipe the destination table completely (
TRUNCATE) and reload the entire dataset from the source every time.Best for: Small reference tables (e.g., a list of US States) where data volume is low and consistency is prioritized over speed.
Simple and reliable but expensive for large datasets.
Trade-off: Very simple to implement (no complex merge logic), but expensive and slow as data grows.
Still used for slowly changing datasets or nightly batches.
Insert-Only
New records are appended; old ones are never modified.
The “current state” must be reconstructed by selecting the newest record per key.
Can be costly during reads unless a materialized view or secondary table maintains the current snapshot.
Avoid single-row inserts in columnar systems—use micro-batches instead.
Works well with systems like BigQuery or Druid’s streaming buffer.
Delete Patterns
Hard delete: permanently remove records (often required by GDPR).
Soft delete: mark records as deleted.
Insert deletion: append a new version of the record with a delete flag.
Deletes are usually expensive in columnar and file-based systems.
Upsert / Merge
Combines updates and inserts depending on whether a matching key exists.
Merges originated in row-based databases, where in-place updates are natural.
File-based systems cannot update in place—they rely on copy-on-write (COW) to rewrite affected file chunks.
Upserts are often the most expensive update pattern in distributed columnar storage.
Still, large batch merges may outperform transactional row stores.
Performance depends heavily on:
Partitioning strategy
Clustering / sorting of files
Minimizing merge frequency
Avoiding row-level real-time merges
Many teams struggle by running CDC-based, per-record merges in systems not designed for them.
Note about careful approach to update patterns
Modern data engineering pipelines require careful planning around how data is updated. Update patterns—truncate/reload, insert-only, deletes, and merges—must be matched to the capabilities of the underlying storage system. Lakehouse table formats and CDC ingestion strategies now make updates feasible, but merge frequency, file layout, and partitioning remain critical to achieving good performance.
Last updated