Apache Arrow and Pyarrow


Docsarrow-up-right

ADBCarrow-up-right

Python APIarrow-up-right



PyArrow: units

Here's the hierarchy from simplest to most complex units in PyArrow:

1. Array (simplest unit)

  • A single column of data

  • Example: pa.array([1, 2, 3])

  • One data type, one-dimensional

2. RecordBatch (collection of arrays)

  • Multiple equal-length arrays bundled together

  • Guaranteed to be contiguous in memory (single chunk)

  • Example: A batch with columns [user_id, score, name]

  • This is your "unit of execution" for most operations

    • unit you work with most—it has the right balance of structure (schema + multiple columns) and performance (single contiguous chunk)

3. Table (collection of RecordBatches)

  • Can contain multiple RecordBatches (chunked data)

  • Not guaranteed to be contiguous

  • Example: Combine 10 RecordBatches into one Table


PyArrow: RecordBatch

This guide covers the essential RecordBatch patterns you might want to use in modern data engineering. The key takeaways: use explicit schemas, leverage zero-copy operations, handle nulls carefully, and remember that RecordBatches are immutable but cheap to reconstruct.


Creating a RecordBatch and Basic Operations

A RecordBatch is a collection of equal-length arrays stored contiguously in memory—your fundamental unit of execution.


Struct Arrays: Nested Data in Columns

Struct Arrays let you represent nested objects (like JSON) while maintaining columnar performance.


Flattening: Breaking Structs Apart

When you need to operate on individual fields (e.g., calculate on all latitudes), flatten the struct.


Conversion to Native Python (Handling Nulls)

When moving data back to Python objects, handle None values explicitly.


Pro-Level Features for Data Engineering

Zero-Copy NumPy Views

For numerical operations, create NumPy views instead of converting to Python lists—they share the same memory.

Schema Evolution

RecordBatches are immutable, but you can efficiently create new ones with modified schemas.

The Schema Object: Your Data Contract

Always define schemas explicitly. This is your pipeline's contract and catches errors early.


Bonus: RecordBatch vs Table


PyArrow: Tables

Working with Chunked Columnar Data

Tables are your go-to structure for larger datasets. Unlike RecordBatches, Tables can contain multiple chunks and provide richer operations for data manipulation.

1. Creating Tables and Basic Operations

Tables can be created from RecordBatches, arrays, dictionaries, or Pandas DataFrames.


2. Adding, Removing, and Renaming Columns

Tables support schema evolution operations that RecordBatches don't.


3. Filtering and Sorting

Tables support powerful filtering and sorting operations using PyArrow compute functions.


4. Grouping and Aggregation

Tables can be grouped and aggregated using PyArrow's dataset API.


5. Combining Tables

Tables can be concatenated vertically (adding rows) or joined horizontally.


6. Working with Chunked Columns

Tables contain ChunkedArrays, which can have multiple underlying chunks.


7. Converting Between Formats

Tables provide rich conversion capabilities to and from other formats.


8. Pro Pattern: Lazy Table Operations

Chain operations efficiently without materializing intermediate results.


Key Differences: RecordBatch vs Table


The pattern: Use RecordBatches for low-level, performance-critical operations where you need memory contiguity. Use Tables for higher-level data manipulation, transformations, and when working with larger-than-memory datasets split across chunks.


Last updated