Dataset API
Dataset API
The Core Concept: Scanning vs. Loading
import pyarrow.dataset as ds
# 1. Define the dataset (This is instant, even for 10,000 files)
dataset = ds.dataset("s3://my-bucket/logs/", format="parquet", partitioning="hive")
# 2. Inspect the logical schema
print(dataset.schema)
# 3. See the physical files Arrow found
print(dataset.files[:5])Predicate Pushdown (The Speed Secret)
Dealing with Multiple Formats
The "Scanner": The Engine Under the Hood
Summary of Why the Dataset API Wins
When to use it?
How Dataset API handles Schema Evolution (files have different columns)
Unified Reading
Handling "Incompatible" Types
Dealing with "Garbage" Columns
Inspecting Fragments
Summary of Schema Evolution Strategies
Last updated