Spark Structured Streaming

Processing in micro-batches

This is the modern way to handle real-time data in Spark (and the industry standard for version 4.0).

Spark Structured Streaming

The genius of Structured Streaming is that it treats a stream of data exactly like a static table, except that the table never stops growing.

1. The Core Concept: The Unbounded Table

In the old days (Legacy Spark Streaming), you had to think in "micro-batches" (chunks of 5 seconds). You wrote code differently for streams than for batch jobs.

In Structured Streaming, you write the exact same code as a standard batch query.

Batch: "Calculate the average salary from this static file."
Streaming: "Calculate the average salary from this data... and keep updating the answer as new data arrives every millisecond."

2. The Programming Model

To write a streaming job, you just need to understand three new keywords that control how the data flows.

A. Input (Source)

Where is the data coming from?

File Source: Reads files from a folder as they appear (JSON, CSV, Parquet).
Kafka Source: Reads messages from Apache Kafka (most common in production).
Socket Source: Reads text from a network port (good for testing).

B. Output (Sink)

Where does the answer go?

File Sink: Writes results to storage (S3, HDFS).
Kafka Sink: Pushes results back to another Kafka topic.
Console Sink: Prints to the screen (debugging only).
Memory Sink: Stores in RAM (debugging only).

C. Output Modes (The "Update Strategy")

Since the data is infinite, how do you want the results?

Append Mode (Default): Only add new rows to the output. (Good for simple inserts).
Complete Mode: Rewrite the entire result table every time. (Good for aggregations, like "Total Count").
Update Mode: Only output the rows that changed since the last trigger.

D. Triggers (The "Pacing")

Default (Micro-batch): Process the next batch as soon as the previous one finishes.
Fixed Interval: "Process data every 30 seconds."
AvailableNow (Spark 3.3+): "Process all available data right now, then stop." (Great for periodic batch jobs that use streaming logic).
Continuous (Experimental): Millisecond latency.

3. Hands-On: A "Hello World" Stream

This example simulates a stream by reading files as you drop them into a folder.

Scenario: You are monitoring a folder /tmp/input for new CSV files containing device_id and temperature.

from pyspark.sql import SparkSession
from pyspark.sql.functions import col

spark = SparkSession.builder.appName("StreamExample").getOrCreate()

# 1. READ (Define the Source)
# Notice "readStream" instead of "read"
# We must define a schema strictly for streaming to avoid costly inference.
schema = "device_id STRING, temperature DOUBLE, timestamp TIMESTAMP"

raw_df = spark.readStream \
    .schema(schema) \
    .csv("/tmp/input")  # Monitors this folder

# 2. TRANSFORM (Business Logic)
# Exact same API as batch!
# Let's flag overheating devices.
alert_df = raw_df.filter(col("temperature") > 80)

# 3. WRITE (Start the Stream)
# Notice "writeStream" instead of "write"
query = alert_df.writeStream \
    .format("console") \
    .outputMode("append") \
    .trigger(processingTime='10 seconds') \
    .start()

# Keep the process alive
query.awaitTermination()

4. Critical Concept: Watermarking

The Problem: Data often arrives late. A sensor might lose connection and send data from 10:00 AM at 10:05 AM.
The Solution: You tell Spark, "Wait for late data, but only for 10 minutes."
Code: .withWatermark("timestamp", "10 minutes")
Result: Spark keeps the "state" (memory) open for that time window. After 10 minutes, it drops any data older than that to free up RAM.

Summary: Structured Streaming is just Spark SQL running on an infinite loop.

PreviousTTL NextApache Flink

Last updated 2 months ago

hashtagSpark Structured Streaming

Spark Structured Streaming