Working with filesystems, cloud, dataset API

File Formats supported by PyArrow

PyArrow supports multiple file formats:

Parquet - Columnar format for analytics (compression, encoding, metadata)
Arrow IPC/Feather - Arrow's native format (fastest, zero-copy)
CSV - Text format
JSON - Text format
ORC - Another columnar format

PyArrow provides a unified FileSystem API (pyarrow.fs) that allows you to interact with local storage, S3, and GCS using the same methods. This abstraction is incredibly powerful because it means your data-loading code doesn't have to change if you move your data from a local folder to a cloud bucket.

1. Local Filesystem

The LocalFileSystem is the simplest implementation. It interacts with your machine's native storage.

from pyarrow import fs

# Initialize
local = fs.LocalFileSystem()

# List files in a directory
file_selector = fs.FileSelector("data/raw_taxis", recursive=True)
files = local.get_file_info(file_selector)

for file in files:
    print(f"Path: {file.path}, Size: {file.size} bytes")

# Write a simple buffer to a file
with local.open_output_stream("data/example.txt") as stream:
    stream.write(b"Hello from PyArrow LocalFS")

2. Amazon S3

To use S3, PyArrow uses the AWS C++ SDK under the hood. It can automatically pick up your credentials from ~/.aws/credentials or environment variables.

from pyarrow import fs

# Initialize (Uses default AWS credential chain)
s3 = fs.S3FileSystem(region="us-east-1")

# Alternatively, pass credentials explicitly
# s3 = fs.S3FileSystem(access_key="...", secret_key="...")

# Read a specific file info
info = s3.get_file_info("my-bucket/data/yellow_tripdata_2024.parquet")
print(f"S3 File Type: {info.type}, Last Modified: {info.mtime}")

# Copy a file from S3 to Local
# fs.copy_files(s3, "my-bucket/remote.dat", local, "local_copy.dat")

3. Google Cloud Storage (GCS)

The GcsFileSystem works similarly and typically looks for the GOOGLE_APPLICATION_CREDENTIALS environment variable.

from pyarrow import fs

# Initialize
gcs = fs.GcsFileSystem()

# List contents of a GCS bucket
selector = fs.FileSelector("my-gcp-bucket/logs", recursive=False)
items = gcs.get_file_info(selector)

# Check if a file exists
info = gcs.get_file_info("my-gcp-bucket/config.json")
if info.type == fs.FileType.NotFound:
    print("File does not exist on GCS!")

4. The "Shortcut" (Automatic URI Resolution)

One of the best features of PyArrow is that you don't always have to manually create the filesystem object. You can let Arrow infer it from the URI.

import pyarrow.parquet as pq

# Arrow detects 's3://' and creates the S3FileSystem for you automatically
table = pq.read_table("s3://my-bucket/data.parquet")

# It works for GCS too
# table = pq.read_table("gs://my-bucket/data.parquet")

Comparison Table: Common Methods

Method

Purpose

get_file_info(path)

Returns metadata (size, type, mtime) for a path or list of paths.

create_dir(path)

Creates a directory (and parents if recursive=True).

delete_file(path)

Deletes a single file.

open_input_file(path)

Opens a file for random access (reading).

open_output_stream(path)

Opens a file for sequential writing.

move(src, dest)

Renames/moves a file within the same filesystem.

Pro-Tip: SubTreeFileSystem

If you find yourself constantly typing my-bucket/very/long/path/to/data/..., you can create a SubTreeFileSystem. This "locks" the filesystem object to a specific subdirectory so you can use shorter relative paths.

# Create a 'jail' or root for your operations
raw_data_fs = fs.SubTreeFileSystem("my-bucket/data/raw", s3)

# Now 'file.parquet' actually refers to 's3://my-bucket/data/raw/file.parquet'
with raw_data_fs.open_input_file("file.parquet") as f:
    pass

Reading Thousands of Files (The Dataset API)

Since you asked about filesystems, the most common "next step" is using these filesystems to read a massive directory of files as if they were one single table.

import pyarrow.dataset as ds

# 1. Define the filesystem and the root path
dataset_fs = fs.S3FileSystem(region='us-east-1')
root_path = 'my-taxi-data-bucket/parquet_files/'

# 2. Create a dataset (Arrow will scan the directory)
# This handles partitioning (e.g., year=2026/month=01/) automatically
dataset = ds.dataset(root_path, filesystem=dataset_fs, format="parquet")

# 3. Filter and load only what you need into a Table
# This is lazy: it only reads the files that match the filter
table = dataset.to_table(
    filter=(ds.field("year") == 2026) & (ds.field("month") == 2),
    columns=["id", "pickup_datetime", "fare_amount"]
)

print(f"Loaded {table.num_rows} rows from S3.")

Writing partitioned data to GCS/S3 buckets

Writing partitioned datasets back to cloud storage is one of the most common tasks in data engineering. Instead of writing one massive, unmanageable file, you break it down into a folder structure (e.g., year=2026/month=02/).

This allows tools like Athena, DuckDB, or even PyArrow itself to "skip" entire folders when querying, saving you time and money.

Writing Partitioned Datasets

To do this, we use pyarrow.dataset.write_dataset. It handles the filesystem connection, the directory creation, and the file naming for you.

import pyarrow as pa
import pyarrow.dataset as ds
from pyarrow import fs

# 1. Setup your Table and Filesystem
# Assume 'table' is your data and 's3_fs' is your S3FileSystem object
s3_fs = fs.S3FileSystem(region='us-east-1')
base_path = 'my-analytics-bucket/clean_data'

# 2. Write the dataset
ds.write_dataset(
    table, 
    base_root=base_path, 
    filesystem=s3_fs,
    format="parquet",
    partitioning=["year", "month"],  # Columns to partition by
    partitioning_flavor="hive",      # Creates 'year=2026' style folders
    existing_data_behavior="overwrite_or_ignore"
)

Key Options for `write_dataset`

Argument

Purpose

partitioning

A list of column names. The order matters (e.g., year then month).

partitioning_flavor

"hive" is the industry standard (e.g., /color=red/).

max_rows_per_file

Prevents files from getting too large (e.g., 1024 * 1024).

existing_data_behavior

Use "delete_matching" to refresh a partition or "overwrite_or_ignore" to add to it.

basename_template

Allows you to customize the filename (e.g., data_{i}.parquet).

The "Pro" Pattern: Reading + Filtering + Writing

Since you are interested in high-performance data engineering, here is how you combine everything we've talked about: Streaming data from one location, filtering it, and writing it to another—all without loading the whole thing into RAM at once.

# Create the input dataset (Scanning S3)
input_ds = ds.dataset("s3://raw-bucket/taxis/", format="parquet")

# Use a Scanner to process in batches (Memory Efficient)
scanner = input_ds.scanner(
    columns=["id", "fare", "year", "month"],
    filter=ds.field("fare") > 0
)

# Write the scanned results directly to a new location
ds.write_dataset(
    scanner, 
    base_root="s3://refined-bucket/taxis/",
    partitioning=["year"],
    format="parquet",
    filesystem=s3_fs
)

By using the Scanner, you aren't just reading a book about memory—you're actually utilizing the Record Batches and Zero-copy principles we discussed. The scanner moves batches of data through your CPU and back out to S3 while keeping your RAM usage low and constant, regardless of whether your dataset is 1GB or 1TB.

PreviousApache Arrow and Pyarrow NextWorking with CSV

Last updated 7 days ago

hashtagFile Formats supported by PyArrow

hashtag1. Local Filesystem

hashtag2. Amazon S3

hashtag3. Google Cloud Storage (GCS)

hashtag4. The "Shortcut" (Automatic URI Resolution)

hashtagComparison Table: Common Methods

hashtagPro-Tip: SubTreeFileSystem

hashtagReading Thousands of Files (The Dataset API)

hashtagWriting partitioned data to GCS/S3 buckets

hashtagWriting Partitioned Datasets

hashtagKey Options for write_dataset

hashtagThe "Pro" Pattern: Reading + Filtering + Writing