Limitation of Arrow


Arrow's Fundamental Architecture: Movement vs. Management

Critical distinction in data engineering: Movement vs. Management.

The adbc_ingest operation with mode="append" is not idempotent. Running the same ingestion code twice will result in duplicate data in the target table. ADBC at this level functions as a high-throughput pipe—it moves bytes as quickly as possible without tracking what has already been written.

The Idempotency Gap

Arrow (and by extension Parquet on S3) is primarily a Data Movement and Analytics toolset. It provides the high-speed rails but lacks the traffic control—the metadata layer—needed for true row-level merge operations.

Arrow's Strength: Lightning-fast, one-time movements or complete partition overwrites.

Arrow's Limitation: It cannot peek into a file on S3 to determine if row_id=123 already exists before writing a new file.

Achieving Idempotency Without Table Formats

To make ADBC-based ingestion idempotent without introducing complex layers like Iceberg, the logic must be handled manually using SQL in the target database (Postgres, Snowflake, etc.).

The Staging Table Pattern:

  1. Use ADBC to ingest data into a temporary staging table (mode="create" or mode="truncate").

  2. Execute a SQL MERGE or INSERT ... ON CONFLICT statement within the database to reconcile the staging table with the final table.

  3. Drop or truncate the staging table for the next run.

This approach delegates idempotency enforcement to the database engine rather than the Arrow layer.

When Arrow Excels for Persistence

Arrow isn't limited to one-time movement—it thrives in Immutable Architectures.

When data follows an append-only or replace-partition philosophy, Arrow is the optimal choice:

Scenario A (Event Logs): Writing raw application logs. Old entries are never modified; new events are simply appended.

Scenario B (Daily Snapshots): Replacing the entire day=2026-02-06 partition each time the job executes.

In these patterns, Arrow is the gold standard. However, the moment a use case requires updating a specific record from days prior—such as changing an order's status—the architecture has crossed into the domain of Table Formats (Iceberg, Delta Lake).

Why Additional Layers Became Industry Standard

The reason Apache Iceberg and Delta Lake are deployed on top of S3 Parquet files is precisely to solve the identified gap.

These layers maintain a Manifest File (typically JSON or Avro) that tracks which Parquet files are currently valid. During an upsert operation, the layer:

  1. Reads the relevant existing Parquet file

  2. Updates the specific row in memory

  3. Writes a new Parquet file with the updated data

  4. Updates the Manifest to reference the new file and mark the old one as inactive

Arrow serves as the engine that makes those reads and writes fast, but Iceberg acts as the librarian that knows which file contains the current truth.

This separation of concerns—Arrow for performance, table formats for correctness—has become the architectural pattern for modern data lakes.


Last updated