About Arrow IPC/Feather format


Apache Feather

Arrow is the "bridge" between the storage world (Parquet/ORC) and the compute world (CPU/RAM).

Feather (specifically Feather V2) is essentially the Arrow IPC(IPC stands for Inter-Process Communication) format saved to a file. When you save a RecordBatch to disk as a Feather file, you are taking the exact byte-for-byte memory layout of Arrow and dumping it into a file with a small footer.

The "magic" of Arrow IPC is that it is identical to the Arrow in-memory format.


Why use Feather instead of Parquet?

If Parquet is a "Suitcase" (packed tight for travel), Feather is a "Wardrobe" (everything is already on hangers).

Feature

Parquet

Feather (Arrow IPC)

Speed

Fast (but needs decompression/decoding)

Blazing (Zero-copy / Memory-mapping)

CPU Usage

High (Decompressing data)

Near Zero (Data is ready to use)

File Size

Small (Highly compressed)

Larger (Uncompressed or lightly compressed)

Portability

Universal (Spark, Presto, Snowflake)

Optimized for Python/R/Go/C++

The "Magic" of Memory Mapping

The coolest thing about Feather is Memory Mapping (mmap).

When you "read" a Feather file, the operating system doesn't actually load the file into your RAM. Instead, it creates a mapping. Your program thinks the data is in memory, and the OS only pulls the specific bytes from the disk at the exact moment your CPU asks for them.

import pyarrow.feather as feather

# Writing is straightforward
feather.write_feather(table, 'data.feather')

# Reading with memory mapping
# This is nearly instant, even for a 10GB file
read_table = feather.read_table('data.feather', memory_map=True)

When to use which?

  • Use Parquet for long-term storage, data lakes (S3/GCS), and sharing data with other teams or tools like Spark and Hive. It saves you money on storage costs.

  • Use Feather/IPC for short-term "warm" data, passing data between microservices, or local caching. If you have a Python script that processes data and a Go service that needs to read it immediately after, Feather/IPC is the king of speed.

A Quick Note on "V1" vs "V2"

You might see old tutorials talking about Feather being limited. That was V1.

Feather V2 (the current version) is fully based on the Apache Arrow IPC format. It supports all Arrow data types, including those Struct Arrays, and even supports compression (LZ4 or ZSTD) if you want a middle ground between speed and size.


Where does this fit in "Streaming" knowledge?

Since Feather is just the IPC format in a file, you can actually convert a stream directly into a Feather file. They are two sides of the same coin:

  • IPC Stream: For sending batches over a network/pipe.

  • Feather File: For "freezing" those same batches onto a disk.


Last updated