Working with filesystems, cloud, dataset API


File Formats supported by PyArrow

PyArrow supports multiple file formats:

  • Parquet - Columnar format for analytics (compression, encoding, metadata)

  • Arrow IPC/Feather - Arrow's native format (fastest, zero-copy)

  • CSV - Text format

  • JSON - Text format

  • ORC - Another columnar format


PyArrow provides a unified FileSystem API (pyarrow.fs) that allows you to interact with local storage, S3, and GCS using the same methods. This abstraction is incredibly powerful because it means your data-loading code doesn't have to change if you move your data from a local folder to a cloud bucket.


1. Local Filesystem

The LocalFileSystem is the simplest implementation. It interacts with your machine's native storage.

from pyarrow import fs

# Initialize
local = fs.LocalFileSystem()

# List files in a directory
file_selector = fs.FileSelector("data/raw_taxis", recursive=True)
files = local.get_file_info(file_selector)

for file in files:
    print(f"Path: {file.path}, Size: {file.size} bytes")

# Write a simple buffer to a file
with local.open_output_stream("data/example.txt") as stream:
    stream.write(b"Hello from PyArrow LocalFS")

2. Amazon S3

To use S3, PyArrow uses the AWS C++ SDK under the hood. It can automatically pick up your credentials from ~/.aws/credentials or environment variables.


3. Google Cloud Storage (GCS)

The GcsFileSystem works similarly and typically looks for the GOOGLE_APPLICATION_CREDENTIALS environment variable.


4. The "Shortcut" (Automatic URI Resolution)

One of the best features of PyArrow is that you don't always have to manually create the filesystem object. You can let Arrow infer it from the URI.


Comparison Table: Common Methods

Method

Purpose

get_file_info(path)

Returns metadata (size, type, mtime) for a path or list of paths.

create_dir(path)

Creates a directory (and parents if recursive=True).

delete_file(path)

Deletes a single file.

open_input_file(path)

Opens a file for random access (reading).

open_output_stream(path)

Opens a file for sequential writing.

move(src, dest)

Renames/moves a file within the same filesystem.

Pro-Tip: SubTreeFileSystem

If you find yourself constantly typing my-bucket/very/long/path/to/data/..., you can create a SubTreeFileSystem. This "locks" the filesystem object to a specific subdirectory so you can use shorter relative paths.


Reading Thousands of Files (The Dataset API)

Since you asked about filesystems, the most common "next step" is using these filesystems to read a massive directory of files as if they were one single table.


Writing partitioned data to GCS/S3 buckets

Writing partitioned datasets back to cloud storage is one of the most common tasks in data engineering. Instead of writing one massive, unmanageable file, you break it down into a folder structure (e.g., year=2026/month=02/).

This allows tools like Athena, DuckDB, or even PyArrow itself to "skip" entire folders when querying, saving you time and money.


Writing Partitioned Datasets

To do this, we use pyarrow.dataset.write_dataset. It handles the filesystem connection, the directory creation, and the file naming for you.


Key Options for write_dataset

Argument

Purpose

partitioning

A list of column names. The order matters (e.g., year then month).

partitioning_flavor

"hive" is the industry standard (e.g., /color=red/).

max_rows_per_file

Prevents files from getting too large (e.g., 1024 * 1024).

existing_data_behavior

Use "delete_matching" to refresh a partition or "overwrite_or_ignore" to add to it.

basename_template

Allows you to customize the filename (e.g., data_{i}.parquet).


The "Pro" Pattern: Reading + Filtering + Writing

Since you are interested in high-performance data engineering, here is how you combine everything we've talked about: Streaming data from one location, filtering it, and writing it to another—all without loading the whole thing into RAM at once.

By using the Scanner, you aren't just reading a book about memory—you're actually utilizing the Record Batches and Zero-copy principles we discussed. The scanner moves batches of data through your CPU and back out to S3 while keeping your RAM usage low and constant, regardless of whether your dataset is 1GB or 1TB.



Last updated