Dataset API


Dataset API

The Dataset API is the powerhouse of PyArrow. While a Table is just data sitting in your RAM, a Dataset is a logical view of data that might be spread across thousands of files on S3, GCS, or your local disk.

It allows you to treat a massive, messy directory of files as a single, clean table, and it uses Lazy Loading to ensure you don't crash your machine.


The Core Concept: Scanning vs. Loading

When you use pq.read_table(), you are loading everything into memory. When you use ds.dataset(), you are just inspecting the metadata. Arrow looks at the file headers and the folder names to understand the schema without reading the actual data rows yet.

import pyarrow.dataset as ds

# 1. Define the dataset (This is instant, even for 10,000 files)
dataset = ds.dataset("s3://my-bucket/logs/", format="parquet", partitioning="hive")

# 2. Inspect the logical schema
print(dataset.schema) 

# 3. See the physical files Arrow found
print(dataset.files[:5])

Predicate Pushdown (The Speed Secret)

This is the "killer feature" of the Dataset API. If your data is partitioned by year and month, and you filter for year == 2026, Arrow is smart enough to never even open the folders for 2024 or 2025. This is called Partition Pruning.


Dealing with Multiple Formats

The Dataset API is format-agnostic. You can even create a dataset that combines different formats (though it's rare in production).


The "Scanner": The Engine Under the Hood

If you want the most control, you create a Scanner. This is the object that actually orchestrates the multi-threaded reading of the data.


Summary of Why the Dataset API Wins

  1. Memory Efficiency: You only load exactly the columns and rows you need.

  2. Schema Unification: If file A has an extra column that file B doesn't, Arrow can "unify" them by filling the missing values with null.

  3. Discovery: It automatically finds files in nested subdirectories and interprets Hive-style partitions (/year=2026/) as actual data columns.

When to use it?

  • Use Table if you have a single file that fits comfortably in your RAM.

  • Use Dataset if you are working with a data lake (S3/GCS) or any directory where data is split across multiple files.


How Dataset API handles Schema Evolution (files have different columns)

In a real-world data lake, your schema is rarely perfect. Over time, your data pipeline might add new columns or change data types. The Dataset API handles this through Schema Unification, allowing you to read files with different structures as one consistent view.

Unified Reading

By default, the Dataset API will attempt to reconcile differences. If file_1 has columns A and B, but file_2 has columns A and C, Arrow will create a unified schema with columns A, B, and C. It simply fills the missing values with null for the files that don't have them.


Handling "Incompatible" Types

Sometimes, unification isn't enough. If file_2024 stored user_id as an Integer, but file_2025 changed it to a String, Arrow might throw a "Type Mismatch" error.

To solve this, you can provide an Explicit Schema. This tells Arrow: "Regardless of what is on disk, I want you to cast the data into this format as you read it."


Dealing with "Garbage" Columns

If your dataset contains hundreds of columns but you only care about three, you don't need to worry about the schema of the other 97. By specifying columns in your to_table or scanner call, you ignore any schema inconsistencies in the unused columns.


Inspecting Fragments

If you are debugging why a dataset is failing, you can look at individual Fragments. A Fragment is essentially a single file within the dataset. You can inspect the physical schema of each specific file to find the "offender."

Summary of Schema Evolution Strategies

  • Default: Let Arrow merge columns (fills missing with null).

  • Strict: Provide an explicit pa.schema to force types (Casting).

  • Lazy: Only select the columns you need via columns=[...] to ignore noise.


Last updated