Handling data quality


Handling data quality in streaming systems is significantly harder than in batch systems. In a database, bad data just sits there. In a stream, bad data is a "Poison Pill"—it can crash your consumers, block your pipelines, and stop the entire factory line.

Here are the three standard defense mechanisms Data Engineers use to handle this.


1. The Guard: Schema Registry (Prevention)

The best way to handle bad data is to reject it before it enters the system.

In a "Schemaless" world (like sending raw JSON), a producer might accidentally change a field from integer to string. This will crash every consumer downstream.

To fix this, we use a Schema Registry (a separate server that stores the "Contract").

  • How it works:

    1. The Producer tries to send a message.

    2. It sends the message structure to the Schema Registry first.

    3. If the structure doesn't match the agreed-upon Schema (e.g., Avro or Protobuf), the send is rejected.

  • Result: The bad data never hits the Event Bus. The Producer team gets an error immediately and has to fix their code.

spinner

2. The Quarantine: Dead Letter Queue (DLQ)

Sometimes, data passes the schema check but is logically invalid (e.g., age: -5) or causes a processing error.

If your consumer crashes on Message #5, it will restart, read Message #5 again, and crash again. This is an infinite loop (a Poison Pill). You need a way to remove that message so the system can proceed to Message #6.

  • The Pattern:

    1. Try: The Consumer attempts to process the event.

    2. Catch: If it fails, do not crash.

    3. Move: Publish that specific failed event to a separate topic called a Dead Letter Queue (DLQ).

    4. Proceed: Commit the offset and move to the next message.

  • What happens to the DLQ?

    • You set up an alert on it.

    • A human (or a separate script) inspects the bad messages later to decide if they should be fixed and replayed or deleted.

spinner

3. The Filter: Stream Validation (Sanitization)

If you are using a stream processor (like Spark Streaming or Flink), you often add a "Quality Layer" right at the start of your pipeline.

Instead of crashing, you split the stream into two:

  • Stream A (Good Data): Goes to the Data Warehouse.

  • Stream B (Bad Data): Goes to an S3 bucket for audit/inspection.

Example Logic:

spinner

Summary of Defense Layers

Layer

Strategy

Tooling

Ingestion

Reject

Schema Registry (Avro/Protobuf)

Processing

Divert

Dead Letter Queues (DLQ)

Logic

Filter

Valid/Invalid Stream Splitting


Last updated