Data Lifecycle & Architectural Patterns


In Data Engineering, understanding the nature or lifecycle of the data dictates the architecture you build.

1. Ephemeral Data (The "Short-Lived" Data)

Concept: This is data that is useful for a very short period and then becomes "stale" or toxic (clutter). It is slightly different from transient data; transient data is moving to a destination, whereas ephemeral data lives in a temporary state (like a cache or session).

  • Examples: Redis caches, user session tokens, intermediate "shuffle" files in a Spark job.

Required Pattern: TTL (Time-To-Live)

  • The Rule: Never write ephemeral data without an expiration date.

  • Why: If you don't set a TTL, your Redis instance will fill up with session data from 2 years ago, crashing your app.

  • Implementation: In Redis, you set EXPIRE 3600 (1 hour). In S3 lifecycle policies, you set "Delete after 7 days" for temp folders.

2. Immutable Data (The "Write-Once" Data)

Concept: Data that, once written, is never modified. If the information changes, you write a new record rather than updating the old one.

  • Examples: Financial transactions, Web Server Logs, Sensor readings (IoT).

Required Pattern: Event Sourcing & Idempotency

  • Event Sourcing: Instead of storing "Current Balance: $50", you store the history: "Credit $10", "Debit $5", "Credit $45". You derive the state by replaying the events.

  • Idempotency: Since you can't update, if you accidentally run a pipeline twice, you might duplicate data. Your system must handle duplicates gracefully (e.g., "Insert if not exists" or de-duplication windows).

3. Derived Data (The "Reconstructible" Data)

Concept: Data that is the result of processing other data. If you deleted it right now, you wouldn't lose information—you would just lose the compute time it took to create it.

  • Examples: Materialized Views, Aggregated Dashboards (Daily Sales), Search Indexes (Elasticsearch).

Required Pattern: Separation of Source & State

  • The Rule: Treat derived data as disposable.

  • The Strategy: Your architecture should allow you to "nuke and rebuild." If your dashboard is showing wrong numbers, you should be able to drop the table and re-run the pipeline from the Raw (Immutable) Source.

  • Mistake to Avoid: Treating a derived table as the "Source of Truth."

4. Slowly Changing Data (SCD)

Concept: Data that is mostly static but changes occasionally and unpredictably.

  • Examples: Customer home addresses, Product categories.

Required Pattern: SCD Type 2 (History Preservation)

  • The Problem: If a user moves from "Astana" to "Almaty", and you just overwrite the record, all their past orders will essentially look like they were shipped to Almaty. You falsify history.

  • The Solution (Type 2): You add columns for start_date, end_date, and is_current.

    • Row 1: Nariman | Astana | 2024-01-01 | 2025-05-01 | False

    • Row 2: Nariman | Almaty | 2025-05-01 | NULL | True

5. Cold vs. Hot Data (Access Patterns)

Concept: Not all data is equal. "Hot" data is accessed every second (current orders). "Cold" data is accessed once a year (audits from 2019).

Required Pattern: Data Tiering (Lifecycle Management)

  • The Rule: Don't pay "Hot" prices for "Cold" data.

  • Implementation: Move data automatically from high-performance storage (SSD/RDS) to cheap object storage (S3 Glacier) based on age.

  • The "Lakehouse" approach: Keep the hot data in the Warehouse (Snowflake/Redshift) for fast SQL, and the cold data in the Lake (S3 Parquet) for occasional querying.

6. Transient Data (The "In-Motion" Data)

Concept: Data that is currently moving from a Source to a Destination. It is not meant to be stored in the processing layer; it exists in memory buffers or network packets only while being transported.

  • Examples: HTTP Requests, WebSocket messages, Kinesis/Firehose buffers, RAM-based queues.

Required Pattern: Dead Letter Queues (DLQ) & Checkpointing

  • The Rule: Assume the destination will be offline at some point.

  • The Strategy: Since the data isn't saved on disk, if the transfer fails, the data is gone forever. You must configure a Dead Letter Queue (DLQ) to catch failed records, or use Checkpointing to mark progress so you can retry from the last successful packet.

  • Mistake to Avoid: "Fire and Forget" protocols for critical data without a backup storage mechanism.

7. Unstructured / Semi-Structured Data

Concept: Data that does not have a pre-defined schema when it arrives. It’s messy (JSON logs, XML, raw text). You don't know what columns exist until you open the file.

  • Examples: Application logs, NoSQL dumps, Social Media feeds.

Required Pattern: Schema-on-Read (ELT)

  • The Old Way (ETL): You force the data to fit a table before loading it. If it doesn't fit, the load fails.

  • The Modern Pattern: Load the raw data as is into the Data Lake (S3). Apply the schema (definitions of columns/types) only when you read or query it. This prevents data loss when formats change unexpectedly.

8. Late-Arriving Data

Concept: Data that has an "Event Time" (when it happened) significantly different from its "Processing Time" (when it arrived). This happens often in mobile apps where a user loses signal, performs actions offline, and uploads them hours later.

  • Examples: IoT sensors with bad connectivity, Mobile app events.

Required Pattern: Watermarking & Windowing

  • The Problem: If you are calculating "Sales per Hour," do you count a sale that happened at 9:00 AM but arrived at 11:00 AM?

  • The Solution: You define a Watermark (e.g., "I will wait 30 minutes for late data"). If data arrives within that window, you update the result. If it arrives after, you either discard it or send it to a side-output.

9. Skewed Data

Concept: Data that is not distributed evenly. In a distributed system (like Spark or Redshift), most data keys are random, but sometimes one key has 90% of the data.

  • Examples: "Null" values in a Join key, or a "Super User" (like a Justin Bieber account on Twitter) that has millions of interactions while everyone else has ten.

Required Pattern: Salting

  • The Problem: If you partition by user_id, the node processing "Justin Bieber" will crash (OOM) while other nodes sit idle.

  • The Solution: You "Salt" the key. You add a random suffix to the ID (e.g., bieber_1, bieber_2... bieber_10). This forces the data to spread across 10 nodes instead of 1. You process them in parallel, then aggregate the results at the end.

10. Sensitive Data (PII)

Concept: Data that carries legal or ethical risk if exposed (GDPR, HIPAA). It requires special handling distinct from "normal" strings.

  • Examples: Emails, IP addresses, Credit Card numbers.

Required Pattern: Tokenization & Masking

  • The Rule: Security is not an afterthought; it is a pipeline step.

  • The Strategy:

    • Masking: Replace characters (nari***@gmail.com) for analysts who don't need the exact email.

    • Tokenization: Swap the sensitive value for a random token (User_998). Keep the mapping in a highly secure, separate "Vault" database. The data warehouse only holds the token.


The Complete Summary Table (10 Patterns)

Here is the full table for your "Data Lifecycle & Architectural Patterns" note. This covers the entire spectrum of data engineering challenges.

#

Concept

Nature of Data

Design Pattern to Use

1

Transient

"Just passing through"

Dead Letter Queue (DLQ) (Catch failures before they vanish)

2

Ephemeral

"Useful only for now"

TTL (Time-To-Live) (Auto-delete after set time)

3

Immutable

"The factual history"

Append-Only Logs (Never update, only insert)

4

Derived

"Calculated result"

Idempotency (Ability to rebuild/rerun without duplication)

5

SCD

"Changes rarely"

SCD Type 2 (Versioning rows with start/end dates)

6

Cold vs. Hot

"Access frequency"

Data Tiering (Move old data to cheaper storage)

7

Unstructured

"Unknown schema"

Schema-on-Read (Store raw, define structure later)

8

Late-Arriving

"Delayed delivery"

Watermarking (Define how long to wait for late events)

9

Skewed

"Uneven distribution"

Salting (Add random suffixes to spread load)

10

Sensitive

"High risk / PII"

Tokenization (Replace real data with secure tokens)


Last updated