Working with filesystems, cloud, dataset API
File Formats supported by PyArrow
1. Local Filesystem
from pyarrow import fs
# Initialize
local = fs.LocalFileSystem()
# List files in a directory
file_selector = fs.FileSelector("data/raw_taxis", recursive=True)
files = local.get_file_info(file_selector)
for file in files:
print(f"Path: {file.path}, Size: {file.size} bytes")
# Write a simple buffer to a file
with local.open_output_stream("data/example.txt") as stream:
stream.write(b"Hello from PyArrow LocalFS")2. Amazon S3
3. Google Cloud Storage (GCS)
4. The "Shortcut" (Automatic URI Resolution)
Comparison Table: Common Methods
Pro-Tip: SubTreeFileSystem
Reading Thousands of Files (The Dataset API)
Writing partitioned data to GCS/S3 buckets
Writing Partitioned Datasets
Key Options for write_dataset
write_datasetThe "Pro" Pattern: Reading + Filtering + Writing
Last updated