Note about handling transient data

What does it mean when data is transient?

In the context of data engineering and the AWS services, transient means "just passing through." The technical stack doesn't matter, but the need to have a backup to prevent data loss does matter.

When data is transient, it is not stored on the disk of that specific service for long-term retrieval. The service holds the data only for the brief moment required to move it from Point A to Point B.

Here is the breakdown of what that implies for you as a Data Engineer:

1. No "Replayability" (The Key Difference)

This is the most critical distinction between Kinesis Data Streams (KDS) and Amazon Data Firehose.

Kinesis Data Streams (Not Transient): KDS is like a Voicemail Inbox. If you listen to a message (consume data), the message stays there. You can come back 4 hours later and listen to it again. The data is persisted (default 24 hours).
Amazon Data Firehose (Transient): Firehose is like a Live Phone Call. It takes the data, buffers it for a few seconds (e.g., 60s), and immediately pushes it to S3. Once it hands the data to S3, it deletes its copy. You cannot "rewind" Firehose to fix a mistake.

2. It exists only in Memory (mostly)

Transient data usually lives in the generic RAM (Random Access Memory) or a small temporary buffer. It is volatile.

If the service crashes while holding transient data, that data could be lost forever (unless the service has a specific "retry" mechanism or a backup queue).

3. The "Pipe vs. Bucket" Analogy

Persistent Data (S3, Database, KDS): This is a Bucket. You pour water in, and it stays there until you decide to take it out.
Transient Data (Firehose, API Calls): This is a Pipe. The pipe is full of water, but the pipe's job isn't to hold the water; its job is to move it. If you turn off the tap, the pipe eventually becomes empty.

Why does this matter to you?

If you are building a pipeline using a transient service (like Firehose), you must configure a Dead Letter Queue (DLQ) or an S3 backup.

Since the service doesn't keep a history, if the delivery fails (e.g., Redshift is offline), the data has nowhere to go. Without a backup configuration, that transient data vanishes into the ether.

Well is it considered a good idea to create a dedicated S3 bucket only for handling DLQ?

Short answer: Usually, no.

For most use cases, creating a dedicated S3 bucket just for a Dead Letter Queue (DLQ) is considered over-engineering.

The standard "Best Practice" in AWS is to use the Same Bucket, Different Prefix.

1. The "Prefix Strategy" (Recommended)

AWS services like Amazon Data Firehose are designed to handle this natively. You can configure them to dump successful data in one folder (prefix) and failed data in another, all within the same bucket.

Why this is better:

Simplicity: You manage one bucket policy, one encryption key, and one Terraform resource.
Cost/Limits: You avoid hitting the AWS soft limit on S3 buckets (100 per account) if you have many pipelines.
Atomic View: Your "Data Lake" is self-contained.

The Structure:

s3://my-data-lake-bucket/
├── raw_data/           <-- Successful Data (The Good Stuff)
│   └── 2023/10/27/...
└── errors/             <-- DLQ / Failed Records (The Bad Stuff)
    └── firehose_error/
        └── 2023/10/27/...

2. When SHOULD you use a Dedicated Bucket?

There are two specific scenarios where a separate bucket is actually the right choice:

Scenario A: The "Toxic" Data (Security)

If your pipeline handles Sensitive PII (e.g., credit cards, social security numbers), failed records are dangerous.

Why: A parsing error might happen because the encryption failed, leaving the PII exposed in plain text in the error log.
Solution: Send errors to a strictly locked-down bucket (s3://my-sensitive-dlq) that only Senior Admins can read. Developers can read the normal bucket, but not the error bucket.

Scenario B: Lifecycle Management

If you have massive differences in retention needs.

Example: You need to keep successful data for 7 years (Compliance), but you only want to keep error logs for 3 days to debug them.
Note: You can do this with Prefix-based Lifecycle rules in a single bucket, but a separate bucket makes the policy "foolproof" so you don't accidentally delete production data.

Summary

Feature

Same Bucket (Prefix)

Dedicated Bucket

Complexity

Low (Preferred)

Medium

Security

Standard

High (Good for PII)

IAM Policies

Shared

Isolated

Use Case

95% of pipelines

Highly Sensitive Data

Example: Implementing the "Same Bucket, Different Prefix" strategy

Here is the Terraform configuration to implement the "Same Bucket, Different Prefix" strategy.

This setup automatically sorts your data:

Success: Goes to /data/year=2024/...
Failure: Goes to /errors/year=2024/... (Result: easy debugging).

Terraform Snippet

resource "aws_kinesis_firehose_delivery_stream" "extended_s3_stream" {
  name        = "terraform-kinesis-firehose-stream"
  destination = "extended_s3"

  extended_s3_configuration {
    role_arn   = aws_iam_role.firehose_role.arn
    bucket_arn = aws_s3_bucket.bucket.arn

    # ---------------------------------------------------------
    # 1. SUCCESS PREFIX (The "Good" Data)
    # We use Hive-style partitioning (year=...) for better query performance later
    # ---------------------------------------------------------
    prefix = "data/year=!{timestamp:yyyy}/month=!{timestamp:MM}/day=!{timestamp:dd}/"

    # ---------------------------------------------------------
    # 2. ERROR PREFIX (The "Bad" Data / DLQ)
    # If processing or delivery fails, the raw record goes here.
    # Note: You MUST end with a forward slash "/"
    # ---------------------------------------------------------
    error_output_prefix = "errors/year=!{timestamp:yyyy}/month=!{timestamp:MM}/day=!{timestamp:dd}/!{firehose:error-output-type}/"

    # Buffer settings (optional tuning)
    buffer_size     = 128
    buffer_interval = 60
  }
}

Key Details to Notice:

!{timestamp:yyyy} Syntax: This is Dynamic Partitioning. Firehose automatically creates folders based on the arrival time of the data. This is crucial for performance if you plan to query this data later using Athena or Glue.
!{firehose:error-output-type}: In the error prefix, I added this extra variable. It creates a sub-folder describing why it failed (e.g., /errors/.../processing-failed/ or /errors/.../delivery-failed/). This organizes your debugging instantly.
Trailing Slash: Always ensure your prefixes end with /. If you forget it, Firehose will mash the filename into the folder name (e.g., .../day=05filename-abc instead of .../day=05/filename-abc).

PreviousStreaming Ecosystem besides Kinesis NextStorage

Last updated 13 days ago

hashtagWhat does it mean when data is transient?

hashtag1. No "Replayability" (The Key Difference)

hashtag2. It exists only in Memory (mostly)

hashtag3. The "Pipe vs. Bucket" Analogy

hashtagWhy does this matter to you?

hashtagWell is it considered a good idea to create a dedicated S3 bucket only for handling DLQ?

hashtag1. The "Prefix Strategy" (Recommended)

hashtag2. When SHOULD you use a Dedicated Bucket?

hashtagSummary

hashtagExample: Implementing the "Same Bucket, Different Prefix" strategy

hashtagTerraform Snippet

hashtagKey Details to Notice: