Streaming
In Apache Arrow, Streaming is the process of moving data one RecordBatch at a time rather than as one giant blob. It is the concept that prevents your computer from running out of memory (OOM) when you are processing datasets that are larger than your available RAM.
Which concept does it relate to?
Streaming is the direct implementation of the RecordBatch concept.
While a Table is a collection of all your data (the whole movie), a Stream is a sequence of RecordBatches (the individual scenes) being sent through a pipe.
The Two Pillars of Arrow Streaming
There are two main ways "Streaming" shows up in the Arrow ecosystem:
1. The IPC (Inter-Process Communication) Stream
This is for moving data between different languages or processes (like sending data from a Python script to a Go service).
How it works: You use an
ArrowStreamWriter. It writes a schema followed by a sequence of RecordBatches.The Benefit: The receiver can start working as soon as the first batch arrives. They don't have to wait for the whole "file" to be finished.
2. The Dataset Scanner (Internal Streaming)
When you use the Dataset API (which we just talked about), you are using streaming under the hood.
The "Scanner": When you call
dataset.scanner().to_batches(), Arrow is essentially "streaming" data from your disk/S3 into your CPU.The Benefit: You can process a 1TB dataset on a laptop with 16GB of RAM because Arrow only ever keeps a few batches in memory at any given time.
Why Streaming is the "Killer App" for Data Engineering
Before Arrow, if you wanted to move data from Spark to Python, you usually had to:
Serialize the data to a temporary format (like CSV or JSON).
Wait for the whole file to write.
Wait for Python to read and decompress the whole file.
With Arrow Streaming, the moment Spark produces the first 10,000 rows (a RecordBatch), those bytes are immediately available for Python to read. Because of Zero-Copy, Python doesn't even have to "load" them—it just looks at the memory address Spark just filled.
Code Example: Manual Streaming in Python
Here is how you manually create a stream. This is useful if you want to push data over a network socket or to a named pipe.
Last updated