Spark Structured Streaming
Processing in micro-batches
This is the modern way to handle real-time data in Spark (and the industry standard for version 4.0).
Spark Structured Streaming
The genius of Structured Streaming is that it treats a stream of data exactly like a static table, except that the table never stops growing.
1. The Core Concept: The Unbounded Table
In the old days (Legacy Spark Streaming), you had to think in "micro-batches" (chunks of 5 seconds). You wrote code differently for streams than for batch jobs.
In Structured Streaming, you write the exact same code as a standard batch query.
Batch: "Calculate the average salary from this static file."
Streaming: "Calculate the average salary from this data... and keep updating the answer as new data arrives every millisecond."
2. The Programming Model
To write a streaming job, you just need to understand three new keywords that control how the data flows.
A. Input (Source)
Where is the data coming from?
File Source: Reads files from a folder as they appear (JSON, CSV, Parquet).
Kafka Source: Reads messages from Apache Kafka (most common in production).
Socket Source: Reads text from a network port (good for testing).
B. Output (Sink)
Where does the answer go?
File Sink: Writes results to storage (S3, HDFS).
Kafka Sink: Pushes results back to another Kafka topic.
Console Sink: Prints to the screen (debugging only).
Memory Sink: Stores in RAM (debugging only).
C. Output Modes (The "Update Strategy")
Since the data is infinite, how do you want the results?
Append Mode (Default): Only add new rows to the output. (Good for simple inserts).
Complete Mode: Rewrite the entire result table every time. (Good for aggregations, like "Total Count").
Update Mode: Only output the rows that changed since the last trigger.
D. Triggers (The "Pacing")
Default (Micro-batch): Process the next batch as soon as the previous one finishes.
Fixed Interval: "Process data every 30 seconds."
AvailableNow (Spark 3.3+): "Process all available data right now, then stop." (Great for periodic batch jobs that use streaming logic).
Continuous (Experimental): Millisecond latency.
3. Hands-On: A "Hello World" Stream
This example simulates a stream by reading files as you drop them into a folder.
Scenario: You are monitoring a folder /tmp/input for new CSV files containing device_id and temperature.
4. Critical Concept: Watermarking
The Problem: Data often arrives late. A sensor might lose connection and send data from 10:00 AM at 10:05 AM.
The Solution: You tell Spark, "Wait for late data, but only for 10 minutes."
Code:
.withWatermark("timestamp", "10 minutes")Result: Spark keeps the "state" (memory) open for that time window. After 10 minutes, it drops any data older than that to free up RAM.
Summary: Structured Streaming is just Spark SQL running on an infinite loop.
Last updated