Dataset API

The Dataset API is the powerhouse of PyArrow. While a Table is just data sitting in your RAM, a Dataset is a logical view of data that might be spread across thousands of files on S3, GCS, or your local disk.

It allows you to treat a massive, messy directory of files as a single, clean table, and it uses Lazy Loading to ensure you don't crash your machine.

The Core Concept: Scanning vs. Loading

When you use pq.read_table(), you are loading everything into memory. When you use ds.dataset(), you are just inspecting the metadata. Arrow looks at the file headers and the folder names to understand the schema without reading the actual data rows yet.

import pyarrow.dataset as ds

# 1. Define the dataset (This is instant, even for 10,000 files)
dataset = ds.dataset("s3://my-bucket/logs/", format="parquet", partitioning="hive")

# 2. Inspect the logical schema
print(dataset.schema) 

# 3. See the physical files Arrow found
print(dataset.files[:5])

Predicate Pushdown (The Speed Secret)

This is the "killer feature" of the Dataset API. If your data is partitioned by year and month, and you filter for year == 2026, Arrow is smart enough to never even open the folders for 2024 or 2025. This is called Partition Pruning.

# This only touches the specific files needed
table = dataset.to_table(
    filter=(ds.field("year") == 2026) & (ds.field("status") == "ERROR"),
    columns=["timestamp", "message"] # Only load these 2 columns (Projection Pushdown)
)

Dealing with Multiple Formats

The Dataset API is format-agnostic. You can even create a dataset that combines different formats (though it's rare in production).

# You can specify custom parsing options for CSVs within a dataset
csv_format = ds.CsvFileFormat(
    parse_options=pa.csv.ParseOptions(delimiter="|")
)

dataset = ds.dataset("local_path/raw_data/", format=csv_format)

The "Scanner": The Engine Under the Hood

If you want the most control, you create a Scanner. This is the object that actually orchestrates the multi-threaded reading of the data.

scanner = dataset.scanner(
    columns={
        # You can even rename or transform columns on the fly
        "total_value": ds.field("price") * ds.field("quantity"),
        "id": ds.field("id")
    },
    filter=ds.field("quantity") > 0,
    use_threads=True # Use all CPU cores for decompression
)

# You can iterate over the scanner in Record Batches to keep RAM low
for batch in scanner.to_batches():
    print(f"Processing a batch of {batch.num_rows} rows...")
    # Do your logic here (e.g., send to an API or a Go service)

Summary of Why the Dataset API Wins

Memory Efficiency: You only load exactly the columns and rows you need.
Schema Unification: If file A has an extra column that file B doesn't, Arrow can "unify" them by filling the missing values with null.
Discovery: It automatically finds files in nested subdirectories and interprets Hive-style partitions (/year=2026/) as actual data columns.

When to use it?

Use Table if you have a single file that fits comfortably in your RAM.
Use Dataset if you are working with a data lake (S3/GCS) or any directory where data is split across multiple files.

How Dataset API handles Schema Evolution (files have different columns)

In a real-world data lake, your schema is rarely perfect. Over time, your data pipeline might add new columns or change data types. The Dataset API handles this through Schema Unification, allowing you to read files with different structures as one consistent view.

Unified Reading

By default, the Dataset API will attempt to reconcile differences. If file_1 has columns A and B, but file_2 has columns A and C, Arrow will create a unified schema with columns A, B, and C. It simply fills the missing values with null for the files that don't have them.

import pyarrow.dataset as ds

# Set 'unify_schemas=True' to allow Arrow to merge different file structures
dataset = ds.dataset(
    "s3://my-bucket/historical-data/", 
    format="parquet", 
    unify_schemas=True
)

# Arrow has now merged all unique column names it found across all files
print(dataset.schema)

Handling "Incompatible" Types

Sometimes, unification isn't enough. If file_2024 stored user_id as an Integer, but file_2025 changed it to a String, Arrow might throw a "Type Mismatch" error.

To solve this, you can provide an Explicit Schema. This tells Arrow: "Regardless of what is on disk, I want you to cast the data into this format as you read it."

import pyarrow as pa

# Define the 'Target' schema you want
target_schema = pa.schema([
    ("user_id", pa.string()), # Force everything to string
    ("amount", pa.float64()),
    ("year", pa.int32())
])

# Pass the schema to the dataset
dataset = ds.dataset(
    "s3://my-bucket/historical-data/",
    schema=target_schema
)

Dealing with "Garbage" Columns

If your dataset contains hundreds of columns but you only care about three, you don't need to worry about the schema of the other 97. By specifying columns in your to_table or scanner call, you ignore any schema inconsistencies in the unused columns.

# Even if other columns have conflicting types, this won't fail
# as long as 'amount' and 'user_id' are consistent or castable.
table = dataset.to_table(columns=["amount", "user_id"])

Inspecting Fragments

If you are debugging why a dataset is failing, you can look at individual Fragments. A Fragment is essentially a single file within the dataset. You can inspect the physical schema of each specific file to find the "offender."

# Get all file fragments
fragments = list(dataset.get_fragments())

for frag in fragments:
    # This shows the schema of the actual file on disk
    print(f"File: {frag.path}")
    print(f"Physical Schema: {frag.physical_schema}")

Summary of Schema Evolution Strategies

Default: Let Arrow merge columns (fills missing with null).
Strict: Provide an explicit pa.schema to force types (Casting).
Lazy: Only select the columns you need via columns=[...] to ignore noise.

PreviousAbout Arrow IPC/Feather format NextStreaming

Last updated 7 days ago

hashtagDataset API

hashtagThe Core Concept: Scanning vs. Loading

hashtagPredicate Pushdown (The Speed Secret)

hashtagDealing with Multiple Formats

hashtagThe "Scanner": The Engine Under the Hood

hashtagSummary of Why the Dataset API Wins

hashtagWhen to use it?

hashtagHow Dataset API handles Schema Evolution (files have different columns)

hashtagUnified Reading

hashtagHandling "Incompatible" Types

hashtagDealing with "Garbage" Columns

hashtagInspecting Fragments

hashtagSummary of Schema Evolution Strategies