Batch architectures

Architectures where batch data processing is involved

Traditional ETL Architecture in Modern Data Engineering

Batch data processing has evolved significantly, but traditional ETL (Extract → Transform → Load) remains one of the foundational patterns in data engineering. Even as cloud warehouses and ELT workflows grow in popularity, ETL is still widely used when data quality, structure, and governance are top priorities. Here’s a clear breakdown of what ETL really is, when it makes sense, and how it works today.

What ETL Is

ETL is a workflow where data is:

Extracted from source systems
Transformed before it reaches the warehouse
Loaded into a curated, analytics-ready schema

Unlike ELT, all heavy lifting—cleaning, joining, validating, enriching—happens outside the warehouse.

Typical Technology Stack

On-premises: Informatica, Oracle Data Integrator (ODI), Pentaho Cloud equivalents: AWS Glue, Google Dataflow (batch), Azure Data Factory Mapping Data Flows

When ETL Makes Sense

ETL is used when:

Schemas are stable and governed
You need a clean, curated data warehouse (star/snowflake)
Transformations are complex, business-logic-driven, or require strict validation
Downstream users expect consistent, high-quality BI datasets

Strengths

Produces a well-modeled, high-trust warehouse
Enforces data quality early in the pipeline
Optimized for reporting and analytics teams

Limitations

Rigid and slow to change (schema updates can be painful)
Hard to reprocess historical data if raw snapshots aren’t stored
Transformation logic can become tightly coupled to ETL tooling

How ETL Actually Runs

A traditional ETL flow looks like:

Sources → ETL engine → Transformed data → Warehouse

The key idea: The warehouse only receives processed, ready-to-use tables.

Transformations don’t happen “inside” the warehouse—they happen in external compute engines (ETL servers, Spark clusters, Glue jobs, etc.).

Does ETL Require Something Like S3?

Before the Cloud

Classic ETL tools operated without object storage. They typically used:

In-memory transformations
Temporary staging databases
ETL servers dedicated to data processing

The flow was simple:

Source → ETL Server → Warehouse

In the Cloud Era

ETL often incorporates object storage like S3, ADLS, or GCS as a staging layer. A typical pattern today is:

Sources → Raw landing bucket → ETL compute (Spark/Glue/Matillion) → Warehouse

But the defining characteristic remains the same:

✔ Transformations happen before loading into the warehouse ✔ The warehouse holds only curated, processed data

How Transformations Occur in ETL Pipelines

1. In-Flight Transformations

Data is transformed as it moves through the pipeline.

Common tools: AWS Glue, Matillion, Spark batch jobs, Airbyte with transforms, Talend.

Example:

API → Glue job (clean + parse + join) → Redshift

2. Staged Transformations

For heavy workloads, raw data may be landed first, transformed, then written back before loading.

Example:

Source → S3 (raw)
S3 → Spark/Glue job → S3 (curated)
S3 (curated) → Warehouse

This still counts as ETL because the warehouse only receives data post-transformation.

ETL vs. ELT: The Key Difference

Although the patterns sound similar, the distinction is clear:

ETL

Compute happens outside the warehouse
Warehouse stores only transformed data
Requires external processing tools (Spark, Glue, Matillion, etc.)

ELT

Raw data is loaded first
Transformations run inside the warehouse
SQL-based transformations (typically via dbt)

A simple rule of thumb:

If the warehouse does the transformations, it’s ELT. If external compute does the transformations, it’s ETL.

ELT Architecture (Extract → Load → Transform)

If ETL is the traditional, rule-driven backbone of enterprise analytics, then ELT (Extract → Load → Transform) is its cloud-native successor — designed for agility, scale, and iteration. As cloud warehouses became powerful enough to handle massive data transformations internally, the industry began shifting from “transform first, load later” to “load first, transform inside.”

What ELT Really Means

In ELT, data pipelines follow a simple pattern:

Extract → pull the data from source systems
Load → store it raw, without major preprocessing
Transform → reshape it directly inside the warehouse or lakehouse

This design flips the old ETL model: instead of doing all transformation before the warehouse, you bring data as-is into a storage or compute engine that can scale elastically — usually Snowflake, BigQuery, Databricks, Redshift, or a lakehouse table format.

The warehouse becomes both the storage and the transformation engine.

Why ELT Became the Default for Modern Data Teams

The rise of ELT coincides directly with two industry changes:

Cloud warehouses became fast and cheap at compute
Data teams wanted flexibility instead of rigid schemas

Agile analytics teams now push raw data into the warehouse as quickly as possible, then refine it collaboratively using SQL and tools like dbt or SQLMesh.

This approach supports:

Experimentation (model your data however you want, whenever you want)
Rapid iteration (change your dbt model, re-run it, done)
Re-processing (because you kept the raw layer)
Lineage and versioning (dbt, Delta Lake, Iceberg, etc.)

In ELT, the warehouse is no longer the final stop — it’s the heart of the transformation process.

The Typical ELT Stack

A “classic” ELT stack looks like:

Warehouse: Snowflake, BigQuery, Databricks SQL, Redshift
Transformation Layer: dbt Core or dbt Cloud
Orchestration: Airflow, Dagster, Prefect, or dbt Cloud scheduling
Ingestion: Fivetran, Airbyte, Meltano, Kafka Connect, or custom ingestion jobs
Storage (optional): S3 / ADLS / GCS as raw lake layers

Once raw data lands in the warehouse or lake, all structure and business logic is applied via SQL transformations in dbt.

ELT Limitations

ELT isn’t perfect — its tradeoffs reflect its design philosophy.

1. You Need a Strong Warehouse or Lakehouse

ELT assumes your warehouse can handle:

Large joins
Complex SQL transformations
Expensive compute workloads

Not all teams or budgets can support that.

2. Compute Cost Can Grow

Transformations move from external ETL servers → into your warehouse billing model.

Heavy dbt models, large fact tables, and multi-step DAGs can drive costs if unmanaged.

Best practices like clustering, partitioning, incremental models, and query optimization become essential.

Lambda Architecture: What It Was, Why It Existed, and Why It Faded

For years, Lambda Architecture was the go-to pattern for systems needing both real-time insights and accurate historical recomputation. It emerged when streaming systems weren’t powerful enough to guarantee correctness on their own. Today, its influence remains, but the architecture itself is mostly obsolete.

What Lambda Architecture Tried to Solve

Early big data stacks had a problem:

Batch systems (Hadoop, early Spark) were accurate but slow.
Streaming systems (Storm, early Kafka Streams) were fast but inaccurate and lacked exactly-once guarantees.

Lambda Architecture solved this with a two-path model:

                +-----------------+
   Raw Data --->|  Speed Layer    |----+
                +-----------------+    |
                                         +--> Serving Layer --> Results
                +-----------------+    |
   Raw Data --->|  Batch Layer    |----+
                +-----------------+

Batch Layer

Stores all historical data (usually in HDFS or S3).
Periodically recomputes results from scratch.
Ensures accuracy, removes drift and errors.

Speed Layer

Consumes streaming data in real time (Kafka + Storm/Spark Streaming/Flink).
Produces low-latency, approximate results until the batch layer catches up.

Serving Layer

Merges:
- Real-time output from the speed layer
- Batch-computed output from the batch layer
Users always get the freshest possible view.

Where It Was Used

Fraud detection (fast decisions + later corrections)
Real-time dashboards with reconciliation
IoT systems before modern stream processors matured

Tech examples:

Batch: Spark batch on S3 or Hadoop
Speed: Kafka → Spark Streaming / Flink
Serving: Cassandra / HBase / precomputed tables

Why It Faded

Lambda Architecture is infamous for one thing:

You maintain two entire pipelines.

Two codebases
Two engines
Two data models
Two failure modes
Two sets of bugs

And you have to synchronize them.

Modern systems made this unnecessary:

Apache Flink, Kafka Streams, Spark Structured Streaming now provide:
- Exactly-once semantics
- Backfills
- Stateful processing
- Reprocessing and replay
Lakehouse architectures (Databricks, Snowflake Iceberg, BigQuery) enable:
- Low-latency ingestion
- Incremental processing
- Reliable batch + streaming unification

This gave rise to the Kappa Architecture, where everything is streaming-first.

Most new architectures use:

Kappa Architecture
Delta Live Tables / Lakehouse streaming
Unified batch+streaming engines

Kappa Architecture: Streaming-First Data Systems Without the Complexity

Kappa Architecture emerged as a reaction to Lambda Architecture’s biggest flaw: two separate pipelines for the same data. Modern stream processors matured enough to make that duplication unnecessary. The idea behind Kappa is simple but powerful:

👉 Everything is a stream — even batch.

Instead of maintaining parallel batch and speed layers, Kappa uses a single streaming pipeline that can replay historical data whenever batch-like processing is needed.

The Core Idea

At the center of Kappa Architecture is the append-only event log — usually Kafka or a similar system. All data flows through it, and downstream processing engines treat the log as the single source of truth.

Here’s the simplified flow:

                +------------------------+
   Raw Data --->|   Event Log (Kafka)    |---+
                +------------------------+   |
                                               +--> Streaming Engine --> Serving Layer
                <---- Reprocess from the log ----+

Streaming Engine

Flink, Kafka Streams, Spark Structured Streaming
Applies transformations continuously
Maintains state with exactly-once guarantees

Reprocessing

No separate batch layer
To recompute historical results, simply rewind and replay the log
Same code, same pipeline → no drift, no duplication

Storage / Serving

Lakehouse tables (Delta, Iceberg, Hudi)
Online stores, materialized views, analytic tables

When Kappa Architecture Works Best

Real-time systems where streaming is the default
Event-driven pipelines
IoT, finance, clickstreams, telemetry, fraud detection
Teams that want a single code path rather than two

It especially shines when:

You want to replay historical data easily
You do frequent schema or pipeline changes
Your system already depends on Kafka as a source of truth

Pros

✔ One code path

No separate batch vs streaming logic. Less maintenance, fewer bugs, easier onboarding.

✔ Reprocessing is trivial

Just reset offsets or re-read the log. Ideal for evolving ML features, metrics, and analytics.

✔ Scales naturally

Streaming engines scale horizontally and handle high-throughput event data.

✔ Fits lakehouses perfectly

Modern Delta/Iceberg/Hudi tables integrate cleanly with streaming ingestion.

Cons

✖ Not great for massive one-off batch jobs

If you need to crunch 10 years of archives once, a big Spark batch job is still simpler.

✖ Dependent on event log retention

If Kafka only retains 7 days of data, you simply cannot reprocess older history. A classic workaround is tiered storage or lakehouse-backed Kafka.

✖ Requires event-first thinking

Systems not designed around events may need to be restructured.

Lakehouse Architecture: The Modern Unified Data Platform

Lakehouse architecture emerged to fix a long-standing problem: data warehouses are great for analytics but expensive and rigid, while data lakes are flexible and cheap but notoriously messy. The Lakehouse combines both worlds — the low-cost storage of a data lake with the management and performance features of a warehouse.

It’s now the dominant architecture behind modern data platforms. It's for both batch and streaming

Why Lakehouses Exist

Historically:

Data warehouses enforced schema, ACID guarantees, governance, and fast SQL — but storing raw data was expensive.
Data lakes allowed dumping raw files cheaply — but lacked schema, transactions, and reliability, making analytics unreliable.

A Lakehouse solves both by adding a table format on top of a data lake that brings ACID guarantees and schema management.

The Core Concept

A Lakehouse is built around open table formats running on cheap object storage (S3, ADLS, GCS):

🔑 Table Formats

Delta Lake (Databricks)
Apache Iceberg
Apache Hudi

These formats introduce:

ACID transactions
Schema enforcement + evolution
Time travel
Partitioning & indexing
Compaction / clustering
Streaming + batch unification

This turns a simple data lake into something that behaves like a warehouse, but at lake scale and cost.

How a Lakehouse Works

                        +--------------------+
                        |   Lakehouse Table  |
                        | (Delta/Iceberg)    |
                        +---------+----------+
                                  |
                ---------------------------------------
                |                 |                    |
         Batch Ingestion   Streaming Ingestion      ML/BI/SQL
                |                 |                    |
         Spark, DLT, Flink   Kafka/Flink/Spark   Databricks/Snowflake/Trino

1. Raw Layer

Raw files stored in object storage
Often partitioned by date or source
No transformation

2. Bronze / Silver / Gold Layers (most common pattern)

Bronze: raw ingested data
Silver: cleaned, conformed tables
Gold: business-ready aggregates and marts

These layers live as Delta/Iceberg/Hudi tables, enabling incremental updates and safe merges.

3. Compute Engines

A Lakehouse supports multiple engines at the same time:

Spark
Flink
Presto/Trino
Snowflake (Iceberg support)
Data warehouse engines querying Lakehouse tables via connectors

This separation of storage and compute is key.

Benefits of a Lakehouse

✔ 1. Unified batch and streaming

Delta, Iceberg, and Hudi support incremental ingestion, allowing the same table to accept:

Streaming upserts from Kafka
Batch loads from files
Reprocessing / replay

✔ 2. Open, vendor-neutral storage

Unlike warehouses, all data sits in:

S3, ADLS, or GCS
In open file formats (Parquet)
With open metadata (table formats)

✔ 3. ACID reliability

No more “dirty reads” or “partial files.” Transactions guarantee atomic writes.

✔ 4. Lower cost than warehouses

You only pay for compute when engines run, not for continuous warehouse storage.

✔ 5. Supports multiple use cases

BI and SQL analytics
Real-time pipelines
ML and feature stores
CDC ingestion
Data sharing

All from the same source of truth.

Drawbacks / Trade-offs

✖ More operational overhead than a SaaS warehouse

You manage ingestion, compaction, table maintenance.

✖ Compute engine choices can be overwhelming

Spark? Flink? Trino? Snowflake as query engine?

✖ Not “fully managed” (unless on Databricks)

Open Lakehouse deployments require DevOps/Platform Engineering.

When to Use a Lakehouse

Use it when you need:

Large-scale analytics (TB–PB)
Unified streaming + batch pipelines
Low-cost raw data storage
Machine learning workflows
Multi-engine flexibility
Open data that’s not tied to a single vendor

If your world is real-time or event-driven (Kafka, Flink), a Lakehouse complements Kappa architecture extremely well.

Reverse ETL

Reverse ETL refers to the process of taking processed, enriched, or modeled data from an OLAP system—typically a data warehouse or lakehouse—and loading it back into operational source systems such as CRMs, marketing platforms, SaaS tools, or internal applications.

Although the term became popular in the late 2010s, the practice itself has existed for much longer. Data engineers have long built ad-hoc pipelines that push analytical data back into operational systems. What changed is the recognition of this process as a formal category and the rise of dedicated tools that automate it.

Why Reverse ETL Matters

Reverse ETL is about data activation—getting insights into the systems where end users work, not just into dashboards.

Example:

CRM + warehouse → model generates lead scores → sales team needs those scores inside Salesforce/HubSpot, not in a BI dashboard or CSV. Reverse ETL reduces user friction by embedding insights directly into operational workflows.

Because of this, reverse ETL is becoming a core responsibility of data teams, similar to building forward ETL pipelines.

How Reverse ETL Works

Reverse ETL pipelines extract prepared data from the warehouse, transform it if necessary, and load it into operational systems. Many modern tools (open-source or commercial) provide connectors, transformations, and scheduling out of the box.

However, the vendor landscape is evolving rapidly—mergers, acquisitions, and absorption into broader data platforms are expected.

Risks and Considerations

Reverse ETL creates feedback loops between analytics and operational systems. Example:

Pull Google Ads data → model bids → push new bids back into Google Ads → repeat. An error in the model can cause runaway spending or harmful automated behavior.

To avoid these issues:

Implement monitoring
Set rate limits and guardrails
Use human-in-the-loop approvals for sensitive workflows
Track lineage and metadata

Reverse ETL pipelines must be treated as operational systems, not “just data workflows.”

Additional Insights for Modern Data Engineering

Reverse ETL vs. Operational Data Sync

Reverse ETL is distinct from:

Operational data replication (e.g., syncing source → source)
CDP activation (customer profiles pushed to channels)
API-based application integrations

Reverse ETL specifically activates analytical data.

Reverse ETL and Data Contracts

Because data is pushed into operational systems:

Schema stability becomes critical
Operational systems often require strict APIs
Failures can cause downstream breakage for business users

Reverse ETL pipelines benefit from contracts and strict SLAs.

Reverse ETL and Warehouse Performance

Reverse ETL often triggers:

Frequent warehouse queries
Large data extractions
Incremental sync logic

To avoid performance degradation:

Use CDC-style incremental syncs
Use materialized views for extraction
Consider scheduler throttling to reduce warehouse strain

Security and Governance

Reverse ETL can push sensitive PII or operational data into external SaaS systems.

Think about:

GDPR/CCPA compliance
Data minimization
Row-level filtering
Fine-grained access control in target systems

Embedding governance into reverse ETL is essential.

Reverse ETL in the Lakehouse Era

Modern table formats (Delta/Iceberg/Hudi) are making it easier to build reverse ETL pipelines from lakehouses (not just warehouses). This allows:

Streaming activation
Near-real-time customer 360 updates
Model predictions delivered within minutes

Reverse ETL is becoming a natural extension of the lakehouse architecture.

Reverse ETL—despite the misleading name—enables the operationalization of analytics by syncing enriched warehouse data back into the business systems where it creates real value. Although the practice is well-established, modern tools and lakehouse technologies have turned it into a first-class data engineering pattern. Teams must remain cautious about feedback loops, governance, and reliability—because reverse ETL pipelines now sit directly in the critical path of operational decision-making.

PreviousOLAP/Data cube NextZero ETL

Last updated 1 month ago

hashtagTraditional ETL Architecture in Modern Data Engineering

hashtagWhat ETL Is

hashtagWhen ETL Makes Sense

hashtagHow ETL Actually Runs

hashtagDoes ETL Require Something Like S3?

hashtagHow Transformations Occur in ETL Pipelines

hashtagETL vs. ELT: The Key Difference

hashtagELT Architecture (Extract → Load → Transform)

hashtagWhat ELT Really Means

hashtagWhy ELT Became the Default for Modern Data Teams

hashtagThe Typical ELT Stack

hashtagELT Limitations

hashtagLambda Architecture: What It Was, Why It Existed, and Why It Faded

hashtagWhat Lambda Architecture Tried to Solve

hashtagWhy It Faded

hashtagYou maintain two entire pipelines.

hashtagKappa Architecture: Streaming-First Data Systems Without the Complexity

hashtagThe Core Idea

hashtagWhen Kappa Architecture Works Best

hashtagPros

hashtagCons

hashtagLakehouse Architecture: The Modern Unified Data Platform

hashtagWhy Lakehouses Exist

hashtagThe Core Concept

hashtagHow a Lakehouse Works

hashtagBenefits of a Lakehouse

hashtagDrawbacks / Trade-offs

hashtagWhen to Use a Lakehouse

hashtagReverse ETL

hashtagWhy Reverse ETL Matters

hashtagHow Reverse ETL Works

hashtagAdditional Insights for Modern Data Engineering

Traditional ETL Architecture in Modern Data Engineering

What ETL Is

When ETL Makes Sense

How ETL Actually Runs

Does ETL Require Something Like S3?

How Transformations Occur in ETL Pipelines

ETL vs. ELT: The Key Difference

ELT Architecture (Extract → Load → Transform)

What ELT Really Means

Why ELT Became the Default for Modern Data Teams

The Typical ELT Stack

ELT Limitations

Lambda Architecture: What It Was, Why It Existed, and Why It Faded

What Lambda Architecture Tried to Solve

Why It Faded

You maintain two entire pipelines.

Kappa Architecture: Streaming-First Data Systems Without the Complexity

The Core Idea

When Kappa Architecture Works Best

Pros

Cons

Lakehouse Architecture: The Modern Unified Data Platform

Why Lakehouses Exist

The Core Concept

How a Lakehouse Works

Benefits of a Lakehouse

Drawbacks / Trade-offs

When to Use a Lakehouse

Reverse ETL

Why Reverse ETL Matters

How Reverse ETL Works

Additional Insights for Modern Data Engineering