PyArrow


PyArrow Dataset API (For Larger-Than-Memory Data)

The Dataset API is PyArrow's solution for working with datasets that are too big to fit in your RAM or are split across hundreds of files (e.g., a folder full of Parquet files partitioned by date).

Instead of loading everything at once, it uses lazy loading. It scans the file metadata first and only loads the specific rows and columns you ask for.

Here is how you use it to filter a massive dataset without crashing your memory:

import pyarrow.dataset as ds
import pyarrow as pa
import shutil
import os

# --- Setup: Let's pretend we have a massive dataset ---
# We will create a folder with 2 separate parquet files
os.makedirs('my_big_dataset', exist_ok=True)

table1 = pa.Table.from_pydict({'id': [1, 2, 3], 'year': [2020, 2020, 2020]})
table2 = pa.Table.from_pydict({'id': [4, 5, 6], 'year': [2021, 2021, 2021]})

import pyarrow.parquet as pq
pq.write_table(table1, 'my_big_dataset/part1.parquet')
pq.write_table(table2, 'my_big_dataset/part2.parquet')
# -------------------------------------------------------

# 1. Initialize the Dataset
# This is instant. It just reads the file paths/metadata, not the data itself.
dataset = ds.dataset('my_big_dataset', format='parquet')

print(f"Total files found: {dataset.files}")

# 2. Scan with a Filter (Predicate Pushdown)
# We want data ONLY from 2021.
# PyArrow is smart enough to skip 'part1.parquet' entirely because 
# it knows (from metadata) that file doesn't contain year == 2021.
scanner = dataset.scanner(
    filter=ds.field('year') == 2021,
    columns=['id', 'year']
)

# 3. Materialize to Table
# Only NOW does it actually load data into RAM.
result_table = scanner.to_table()

print("\n--- Filtered Results ---")
print(result_table.to_pandas())

# Cleanup
shutil.rmtree('my_big_dataset')

Why this matters: If you had 1TB of data but only needed rows where year=2021, the Dataset API might only read 500MB of actual files, skipping the other 999.5GB entirely.


Pyarrow + DuckDB

Because DuckDB and PyArrow both use the Arrow memory format, DuckDB can run SQL queries directly on your Python variables (PyArrow Tables, Datasets, or even Pandas DataFrames) without moving or copying a single byte of data.

This essentially turns your Python script into a high-performance SQL engine.

The "Zero-Copy" SQL Engine

Usually, if you want to SQL query a dataframe, you have to load it into a database (like SQLite), which duplicates the data.

  • DuckDB: It looks at the memory address of your PyArrow table and queries it in place.

  • Result: You can run complex SQL joins and aggregations on millions of rows in milliseconds.

Code Example: SQL on PyArrow

Here is how you can use DuckDB to query a PyArrow table and join it with a separate Parquet file, all using SQL.

Advanced: Joining Data Sources

The real power comes when you mix sources. You can use PyArrow to handle the file IO (reading complex formats) and DuckDB to handle the logic.

Why use DuckDB + PyArrow instead of just Pandas?

  1. SQL Syntax: Many transformations (joins, window functions, CTEs) are much easier and more readable in SQL than in Pandas syntax.

  2. Speed: DuckDB's vectorized execution engine is often faster than Pandas for aggregations (groupby).

  3. Larger-than-Memory: Just like the PyArrow Dataset API, DuckDB can process data that doesn't fit in RAM by spilling to disk efficiently, whereas standard Pandas would crash.


What is Arrow-ADBC?

ADBC (Arrow Database Connectivity) is a new standard API for connecting to databases. Think of it as the modern, high-performance successor to ODBC and JDBC.

The Problem with ODBC/JDBC

Traditional database drivers (like those used in Python's psycopg2 or pyodbc) were designed decades ago. They retrieve data row-by-row.

  1. Database: Executes query (Columnar format).

  2. Conversion 1: Converts to Row format for network transfer.

  3. Network: Sends rows.

  4. Client (Python): Receives rows.

  5. Conversion 2: Python converts rows back into Columns (for Pandas/Arrow).

This double conversion is slow and wastes CPU.

The ADBC Solution

ADBC allows databases to send data in the Arrow format directly.

  1. Database: Executes query.

  2. Network: Sends Arrow data (Zero-Copy / Zero-Conversion).

  3. Client (Python): Receives Arrow data ready for analysis.

Example Use Case

If you use ADBC to query a PostgreSQL or SQLite database, you get a PyArrow Table back immediately. There is no loop converting tuples to Python objects.


Last updated