Handling System Overload: Backpressure and DLQs.
Reliability Patterns in Streaming Data
In streaming systems, we must design for the "Stressed Path" where failure is a mathematical certainty. Reliability is built by separating data-level errors from system-level overload.
Dead Letter Queues (The "Morgue" for Data)
A Dead Letter Queue (DLQ) is a secondary buffer where messages are routed when they cannot be processed after a defined number of retries.
The "Poison Pill" Problem: In a stream, a single malformed record (e.g., a string in an integer field) can crash a consumer. If the consumer simply retries forever, the entire pipeline stops.
The Mechanism:
Detection: The consumer fails to process a message (e.g., deserialization or validation error).
Retry Policy: The system attempts to re-process the message times to account for transient issues (like a brief network flicker).
Routing: If the attempt fails, the message is "dead-lettered" to a separate topic or storage.
Why it's essential: It isolates "bad data" without halting the "good data" flow. This allows engineers to inspect the DLQ later, fix the bug, and re-inject the corrected data into the stream.
In your journey through distributed systems, if a Dead Letter Queue (DLQ) is the "morgue" for messages that died after too many retries, Backpressure is the "braking system" that prevents the whole system from crashing in the first place.
While a DLQ handles individual message failure, Backpressure handles system-level overload.
What is Backpressure?
Backpressure is a flow-control mechanism where a consumer (downstream) signals to a producer (upstream) to slow down because it can't keep up with the volume of data.
Imagine a coffee shop:
The Producer: A group of 50 tourists walks in and orders at once.
The Consumer: One barista making drinks.
Without Backpressure: The barista tries to take every order, gets confused, runs out of milk, and eventually quits (the system crashes/OOM error).
With Backpressure: The barista puts up a "One Minute Wait" sign or stops taking orders until they've cleared the current batch. They are pushing "pressure" back to the source.
How it works in the tools you follow:
1. RabbitMQ (TCP Backpressure)
RabbitMQ uses a very literal form of backpressure. If the broker’s memory usage hits a certain threshold (usually 40% by default), it will block all connections that are trying to publish messages. It literally stops reading from the TCP socket. The producer's code will simply "hang" or wait until the broker clears enough memory to resume.
2. Apache Kafka (The "Pull" Model)
Kafka has "built-in" backpressure because of its architecture. Since consumers pull data from Kafka at their own pace, they can never be "overwhelmed" by the broker. If the consumer is slow, it just lags behind.
The Twist: Backpressure in Kafka usually happens at the Producer → Broker level. If the Kafka brokers are overwhelmed, they will slow down the acknowledgments (ACKs) sent to the producer, forcing the producer to wait.
3. Spark & Flink (The "Feedback Loop")
In stream processing, if a "sink" (like a database) is slow, the processing engine (Flink/Spark) will send a signal back up the chain to the "source" (like Kafka) to fetch fewer records per second.
Backpressure Strategies:
When a system feels backpressure, it usually has four choices:
Strategy
What it does
Result
Buffering
Store excess messages in a queue.
Temporary fix; if the queue fills up, you still need another plan.
Dropping
Throw away the newest (or oldest) data.
Fast, but you lose data (okay for logs/sensor data).
Control
Tell the producer to slow down.
The safest "Engineering" approach (standard in TCP).
Scaling
Automatically spin up more consumers.
Expensive, but keeps the system moving (Kubernetes HPA).
Last updated