Persistent state (The Memory)
Now that we know why we need to remember data (to turn Streams into Tables), this chapter explains how we actually store that memory in a distributed system that might crash at any moment.
Persistent State
The Core Problem: Streaming is stateful.
If you are calculating a 24-hour average, and your server crashes at Hour 23, you cannot afford to lose the previous 23 hours of data.
You need a way to store this intermediate data (State) so that it survives failures and can be moved around if you resize the cluster.
What is "State"?
In streaming, state is simply "info about the past that we need for the future."
Examples:
The current sum of a window.
The buffer of data waiting for a Watermark.
The deduplication map for Exactly-Once processing.
Implicit vs. Explicit State
The book categorizes state into two types based on who manages it.
A. Implicit State (Internal)
Definition: State created automatically by the streaming engine to do its job.
Examples: Grouping buffers, Watermark timestamps, Trigger timers.
Behavior: You (the developer) don't write code to manage this. The engine (Flink/Beam) handles it.
B. Explicit State (User-Defined)
Definition: State that you ask the system to store for your specific business logic.
Examples: "I want to remember the last 5 transactions for this specific user ID to detect fraud."
Mechanism: You use APIs (like
ValueStateorMapState) to read/write this data.
The "Hot" vs. "Cold" Architecture
Where does this state physically live? This is the most critical architectural decision in a streaming engine.
A. The "Cold" Approach (Remote DB)
Design: The worker is stateless. Every time an event arrives, it queries a remote database (e.g., Redis, Cassandra) to fetch the state, updates it, and saves it back.
Pros: Easy to scale the workers (they are dumb).
Cons: Terrible Performance. Network latency killing your throughput.
B. The "Hot" Approach (Local State - The Standard)
Design: The state lives locally on the worker machine (in RAM or on a fast local SSD, often using RocksDB).
Pros: Blazing Fast. No network calls for state access.
Cons: Hard to Scale. If you lose a worker, you lose the state unless you have a backup mechanism.
Fault Tolerance: Checkpointing
Because we use the "Hot" approach (local state), we need a safety net.
Mechanism: Periodically (e.g., every 10 seconds), the system pauses* and takes a "Snapshot" of the local state.
Storage: This snapshot is uploaded to durable, remote storage (like S3 or HDFS).
Recovery: If a worker crashes, a new worker starts, downloads the last snapshot from S3, and resumes processing.
*Note: Modern systems do this asynchronously so they don't actually pause processing.
The "Re-Scaling" Problem
This is the hardest problem in state management.
Scenario: You have 10 servers holding 1TB of state. You want to add 10 more servers to handle higher load.
The Challenge: You cannot just add empty servers. You have to physically move 500GB of state from the old servers to the new ones while the pipeline is running.
The Solution: Key-Group Partitioning. The state is not stored as one giant blob; it is sharded into thousands of tiny "Key Groups." When scaling, the system simply re-assigns ownership of these groups to the new workers.
Summary
State = Memory: Essential for aggregations.
Architecture: "Hot" state (Local) is preferred over "Cold" state (Remote) for performance.
Safety: Achieved via periodic Checkpoints to remote storage (S3/HDFS).
Scaling: Requires physically moving (re-sharding) state between workers.
Last updated