Spark Structured Streaming

Processing in micro-batches


This is the modern way to handle real-time data in Spark (and the industry standard for version 4.0).

Spark Structured Streaming

The genius of Structured Streaming is that it treats a stream of data exactly like a static table, except that the table never stops growing.

1. The Core Concept: The Unbounded Table

In the old days (Legacy Spark Streaming), you had to think in "micro-batches" (chunks of 5 seconds). You wrote code differently for streams than for batch jobs.

In Structured Streaming, you write the exact same code as a standard batch query.

  • Batch: "Calculate the average salary from this static file."

  • Streaming: "Calculate the average salary from this data... and keep updating the answer as new data arrives every millisecond."

2. The Programming Model

To write a streaming job, you just need to understand three new keywords that control how the data flows.

A. Input (Source)

Where is the data coming from?

  • File Source: Reads files from a folder as they appear (JSON, CSV, Parquet).

  • Kafka Source: Reads messages from Apache Kafka (most common in production).

  • Socket Source: Reads text from a network port (good for testing).

B. Output (Sink)

Where does the answer go?

  • File Sink: Writes results to storage (S3, HDFS).

  • Kafka Sink: Pushes results back to another Kafka topic.

  • Console Sink: Prints to the screen (debugging only).

  • Memory Sink: Stores in RAM (debugging only).

C. Output Modes (The "Update Strategy")

Since the data is infinite, how do you want the results?

  1. Append Mode (Default): Only add new rows to the output. (Good for simple inserts).

  2. Complete Mode: Rewrite the entire result table every time. (Good for aggregations, like "Total Count").

  3. Update Mode: Only output the rows that changed since the last trigger.

D. Triggers (The "Pacing")

  • Default (Micro-batch): Process the next batch as soon as the previous one finishes.

  • Fixed Interval: "Process data every 30 seconds."

  • AvailableNow (Spark 3.3+): "Process all available data right now, then stop." (Great for periodic batch jobs that use streaming logic).

  • Continuous (Experimental): Millisecond latency.

3. Hands-On: A "Hello World" Stream

This example simulates a stream by reading files as you drop them into a folder.

Scenario: You are monitoring a folder /tmp/input for new CSV files containing device_id and temperature.

4. Critical Concept: Watermarking

  • The Problem: Data often arrives late. A sensor might lose connection and send data from 10:00 AM at 10:05 AM.

  • The Solution: You tell Spark, "Wait for late data, but only for 10 minutes."

  • Code: .withWatermark("timestamp", "10 minutes")

  • Result: Spark keeps the "state" (memory) open for that time window. After 10 minutes, it drops any data older than that to free up RAM.


Summary: Structured Streaming is just Spark SQL running on an infinite loop.


Last updated