Batch architectures

Architectures where batch data processing is involved


Traditional ETL Architecture in Modern Data Engineering

Batch data processing has evolved significantly, but traditional ETL (Extract → Transform → Load) remains one of the foundational patterns in data engineering. Even as cloud warehouses and ELT workflows grow in popularity, ETL is still widely used when data quality, structure, and governance are top priorities. Here’s a clear breakdown of what ETL really is, when it makes sense, and how it works today.


What ETL Is

ETL is a workflow where data is:

  1. Extracted from source systems

  2. Transformed before it reaches the warehouse

  3. Loaded into a curated, analytics-ready schema

Unlike ELT, all heavy lifting—cleaning, joining, validating, enriching—happens outside the warehouse.

Typical Technology Stack

On-premises: Informatica, Oracle Data Integrator (ODI), Pentaho Cloud equivalents: AWS Glue, Google Dataflow (batch), Azure Data Factory Mapping Data Flows


When ETL Makes Sense

ETL is used when:

  • Schemas are stable and governed

  • You need a clean, curated data warehouse (star/snowflake)

  • Transformations are complex, business-logic-driven, or require strict validation

  • Downstream users expect consistent, high-quality BI datasets

Strengths

  • Produces a well-modeled, high-trust warehouse

  • Enforces data quality early in the pipeline

  • Optimized for reporting and analytics teams

Limitations

  • Rigid and slow to change (schema updates can be painful)

  • Hard to reprocess historical data if raw snapshots aren’t stored

  • Transformation logic can become tightly coupled to ETL tooling


How ETL Actually Runs

A traditional ETL flow looks like:

The key idea: The warehouse only receives processed, ready-to-use tables.

Transformations don’t happen “inside” the warehouse—they happen in external compute engines (ETL servers, Spark clusters, Glue jobs, etc.).


Does ETL Require Something Like S3?

Before the Cloud

Classic ETL tools operated without object storage. They typically used:

  • In-memory transformations

  • Temporary staging databases

  • ETL servers dedicated to data processing

The flow was simple:

In the Cloud Era

ETL often incorporates object storage like S3, ADLS, or GCS as a staging layer. A typical pattern today is:

But the defining characteristic remains the same:

✔ Transformations happen before loading into the warehouse ✔ The warehouse holds only curated, processed data


How Transformations Occur in ETL Pipelines

1. In-Flight Transformations

Data is transformed as it moves through the pipeline.

Common tools: AWS Glue, Matillion, Spark batch jobs, Airbyte with transforms, Talend.

Example:

2. Staged Transformations

For heavy workloads, raw data may be landed first, transformed, then written back before loading.

Example:

This still counts as ETL because the warehouse only receives data post-transformation.


ETL vs. ELT: The Key Difference

Although the patterns sound similar, the distinction is clear:

ETL

  • Compute happens outside the warehouse

  • Warehouse stores only transformed data

  • Requires external processing tools (Spark, Glue, Matillion, etc.)

ELT

  • Raw data is loaded first

  • Transformations run inside the warehouse

  • SQL-based transformations (typically via dbt)

A simple rule of thumb:

If the warehouse does the transformations, it’s ELT. If external compute does the transformations, it’s ETL.


ELT Architecture (Extract → Load → Transform)

If ETL is the traditional, rule-driven backbone of enterprise analytics, then ELT (Extract → Load → Transform) is its cloud-native successor — designed for agility, scale, and iteration. As cloud warehouses became powerful enough to handle massive data transformations internally, the industry began shifting from “transform first, load later” to “load first, transform inside.”

What ELT Really Means

In ELT, data pipelines follow a simple pattern:

  1. Extract → pull the data from source systems

  2. Load → store it raw, without major preprocessing

  3. Transform → reshape it directly inside the warehouse or lakehouse

This design flips the old ETL model: instead of doing all transformation before the warehouse, you bring data as-is into a storage or compute engine that can scale elastically — usually Snowflake, BigQuery, Databricks, Redshift, or a lakehouse table format.

The warehouse becomes both the storage and the transformation engine.

Why ELT Became the Default for Modern Data Teams

The rise of ELT coincides directly with two industry changes:

  1. Cloud warehouses became fast and cheap at compute

  2. Data teams wanted flexibility instead of rigid schemas

Agile analytics teams now push raw data into the warehouse as quickly as possible, then refine it collaboratively using SQL and tools like dbt or SQLMesh.

This approach supports:

  • Experimentation (model your data however you want, whenever you want)

  • Rapid iteration (change your dbt model, re-run it, done)

  • Re-processing (because you kept the raw layer)

  • Lineage and versioning (dbt, Delta Lake, Iceberg, etc.)

In ELT, the warehouse is no longer the final stop — it’s the heart of the transformation process.

The Typical ELT Stack

A “classic” ELT stack looks like:

  • Warehouse: Snowflake, BigQuery, Databricks SQL, Redshift

  • Transformation Layer: dbt Core or dbt Cloud

  • Orchestration: Airflow, Dagster, Prefect, or dbt Cloud scheduling

  • Ingestion: Fivetran, Airbyte, Meltano, Kafka Connect, or custom ingestion jobs

  • Storage (optional): S3 / ADLS / GCS as raw lake layers

Once raw data lands in the warehouse or lake, all structure and business logic is applied via SQL transformations in dbt.

ELT Limitations

ELT isn’t perfect — its tradeoffs reflect its design philosophy.

1. You Need a Strong Warehouse or Lakehouse

ELT assumes your warehouse can handle:

  • Large joins

  • Complex SQL transformations

  • Expensive compute workloads

Not all teams or budgets can support that.


2. Compute Cost Can Grow

Transformations move from external ETL servers → into your warehouse billing model.

Heavy dbt models, large fact tables, and multi-step DAGs can drive costs if unmanaged.

Best practices like clustering, partitioning, incremental models, and query optimization become essential.


Lambda Architecture: What It Was, Why It Existed, and Why It Faded

For years, Lambda Architecture was the go-to pattern for systems needing both real-time insights and accurate historical recomputation. It emerged when streaming systems weren’t powerful enough to guarantee correctness on their own. Today, its influence remains, but the architecture itself is mostly obsolete.

What Lambda Architecture Tried to Solve

Early big data stacks had a problem:

  • Batch systems (Hadoop, early Spark) were accurate but slow.

  • Streaming systems (Storm, early Kafka Streams) were fast but inaccurate and lacked exactly-once guarantees.

Lambda Architecture solved this with a two-path model:

Batch Layer

  • Stores all historical data (usually in HDFS or S3).

  • Periodically recomputes results from scratch.

  • Ensures accuracy, removes drift and errors.

Speed Layer

  • Consumes streaming data in real time (Kafka + Storm/Spark Streaming/Flink).

  • Produces low-latency, approximate results until the batch layer catches up.

Serving Layer

  • Merges:

    • Real-time output from the speed layer

    • Batch-computed output from the batch layer

  • Users always get the freshest possible view.


Where It Was Used

  • Fraud detection (fast decisions + later corrections)

  • Real-time dashboards with reconciliation

  • IoT systems before modern stream processors matured

Tech examples:

  • Batch: Spark batch on S3 or Hadoop

  • Speed: Kafka → Spark Streaming / Flink

  • Serving: Cassandra / HBase / precomputed tables

Why It Faded

Lambda Architecture is infamous for one thing:

You maintain two entire pipelines.

  • Two codebases

  • Two engines

  • Two data models

  • Two failure modes

  • Two sets of bugs

And you have to synchronize them.

Modern systems made this unnecessary:

  • Apache Flink, Kafka Streams, Spark Structured Streaming now provide:

    • Exactly-once semantics

    • Backfills

    • Stateful processing

    • Reprocessing and replay

  • Lakehouse architectures (Databricks, Snowflake Iceberg, BigQuery) enable:

    • Low-latency ingestion

    • Incremental processing

    • Reliable batch + streaming unification

This gave rise to the Kappa Architecture, where everything is streaming-first.

Most new architectures use:

  • Kappa Architecture

  • Delta Live Tables / Lakehouse streaming

  • Unified batch+streaming engines


Kappa Architecture: Streaming-First Data Systems Without the Complexity

Kappa Architecture emerged as a reaction to Lambda Architecture’s biggest flaw: two separate pipelines for the same data. Modern stream processors matured enough to make that duplication unnecessary. The idea behind Kappa is simple but powerful:

👉 Everything is a stream — even batch.

Instead of maintaining parallel batch and speed layers, Kappa uses a single streaming pipeline that can replay historical data whenever batch-like processing is needed.


The Core Idea

At the center of Kappa Architecture is the append-only event log — usually Kafka or a similar system. All data flows through it, and downstream processing engines treat the log as the single source of truth.

Here’s the simplified flow:

Streaming Engine

  • Flink, Kafka Streams, Spark Structured Streaming

  • Applies transformations continuously

  • Maintains state with exactly-once guarantees

Reprocessing

  • No separate batch layer

  • To recompute historical results, simply rewind and replay the log

  • Same code, same pipeline → no drift, no duplication

Storage / Serving

  • Lakehouse tables (Delta, Iceberg, Hudi)

  • Online stores, materialized views, analytic tables


When Kappa Architecture Works Best

  • Real-time systems where streaming is the default

  • Event-driven pipelines

  • IoT, finance, clickstreams, telemetry, fraud detection

  • Teams that want a single code path rather than two

It especially shines when:

  • You want to replay historical data easily

  • You do frequent schema or pipeline changes

  • Your system already depends on Kafka as a source of truth


Pros

One code path

No separate batch vs streaming logic. Less maintenance, fewer bugs, easier onboarding.

Reprocessing is trivial

Just reset offsets or re-read the log. Ideal for evolving ML features, metrics, and analytics.

Scales naturally

Streaming engines scale horizontally and handle high-throughput event data.

Fits lakehouses perfectly

Modern Delta/Iceberg/Hudi tables integrate cleanly with streaming ingestion.


Cons

Not great for massive one-off batch jobs

If you need to crunch 10 years of archives once, a big Spark batch job is still simpler.

Dependent on event log retention

If Kafka only retains 7 days of data, you simply cannot reprocess older history. A classic workaround is tiered storage or lakehouse-backed Kafka.

Requires event-first thinking

Systems not designed around events may need to be restructured.


Lakehouse Architecture: The Modern Unified Data Platform

Lakehouse architecture emerged to fix a long-standing problem: data warehouses are great for analytics but expensive and rigid, while data lakes are flexible and cheap but notoriously messy. The Lakehouse combines both worlds — the low-cost storage of a data lake with the management and performance features of a warehouse.

It’s now the dominant architecture behind modern data platforms. It's for both batch and streaming


Why Lakehouses Exist

Historically:

  • Data warehouses enforced schema, ACID guarantees, governance, and fast SQL — but storing raw data was expensive.

  • Data lakes allowed dumping raw files cheaply — but lacked schema, transactions, and reliability, making analytics unreliable.

A Lakehouse solves both by adding a table format on top of a data lake that brings ACID guarantees and schema management.


The Core Concept

A Lakehouse is built around open table formats running on cheap object storage (S3, ADLS, GCS):

🔑 Table Formats

  • Delta Lake (Databricks)

  • Apache Iceberg

  • Apache Hudi

These formats introduce:

  • ACID transactions

  • Schema enforcement + evolution

  • Time travel

  • Partitioning & indexing

  • Compaction / clustering

  • Streaming + batch unification

This turns a simple data lake into something that behaves like a warehouse, but at lake scale and cost.


How a Lakehouse Works

1. Raw Layer

  • Raw files stored in object storage

  • Often partitioned by date or source

  • No transformation

2. Bronze / Silver / Gold Layers (most common pattern)

  • Bronze: raw ingested data

  • Silver: cleaned, conformed tables

  • Gold: business-ready aggregates and marts

These layers live as Delta/Iceberg/Hudi tables, enabling incremental updates and safe merges.

3. Compute Engines

A Lakehouse supports multiple engines at the same time:

  • Spark

  • Flink

  • Presto/Trino

  • Snowflake (Iceberg support)

  • Data warehouse engines querying Lakehouse tables via connectors

This separation of storage and compute is key.


Benefits of a Lakehouse

✔ 1. Unified batch and streaming

Delta, Iceberg, and Hudi support incremental ingestion, allowing the same table to accept:

  • Streaming upserts from Kafka

  • Batch loads from files

  • Reprocessing / replay

✔ 2. Open, vendor-neutral storage

Unlike warehouses, all data sits in:

  • S3, ADLS, or GCS

  • In open file formats (Parquet)

  • With open metadata (table formats)

✔ 3. ACID reliability

No more “dirty reads” or “partial files.” Transactions guarantee atomic writes.

✔ 4. Lower cost than warehouses

You only pay for compute when engines run, not for continuous warehouse storage.

✔ 5. Supports multiple use cases

  • BI and SQL analytics

  • Real-time pipelines

  • ML and feature stores

  • CDC ingestion

  • Data sharing

All from the same source of truth.


Drawbacks / Trade-offs

✖ More operational overhead than a SaaS warehouse

You manage ingestion, compaction, table maintenance.

✖ Compute engine choices can be overwhelming

Spark? Flink? Trino? Snowflake as query engine?

✖ Not “fully managed” (unless on Databricks)

Open Lakehouse deployments require DevOps/Platform Engineering.


When to Use a Lakehouse

Use it when you need:

  • Large-scale analytics (TB–PB)

  • Unified streaming + batch pipelines

  • Low-cost raw data storage

  • Machine learning workflows

  • Multi-engine flexibility

  • Open data that’s not tied to a single vendor

If your world is real-time or event-driven (Kafka, Flink), a Lakehouse complements Kappa architecture extremely well.


Reverse ETL

Reverse ETL refers to the process of taking processed, enriched, or modeled data from an OLAP system—typically a data warehouse or lakehouse—and loading it back into operational source systems such as CRMs, marketing platforms, SaaS tools, or internal applications.

Although the term became popular in the late 2010s, the practice itself has existed for much longer. Data engineers have long built ad-hoc pipelines that push analytical data back into operational systems. What changed is the recognition of this process as a formal category and the rise of dedicated tools that automate it.


Why Reverse ETL Matters

Reverse ETL is about data activation—getting insights into the systems where end users work, not just into dashboards.

Example:

  • CRM + warehouse → model generates lead scores → sales team needs those scores inside Salesforce/HubSpot, not in a BI dashboard or CSV. Reverse ETL reduces user friction by embedding insights directly into operational workflows.

Because of this, reverse ETL is becoming a core responsibility of data teams, similar to building forward ETL pipelines.


How Reverse ETL Works

Reverse ETL pipelines extract prepared data from the warehouse, transform it if necessary, and load it into operational systems. Many modern tools (open-source or commercial) provide connectors, transformations, and scheduling out of the box.

However, the vendor landscape is evolving rapidly—mergers, acquisitions, and absorption into broader data platforms are expected.


Risks and Considerations

Reverse ETL creates feedback loops between analytics and operational systems. Example:

  • Pull Google Ads data → model bids → push new bids back into Google Ads → repeat. An error in the model can cause runaway spending or harmful automated behavior.

To avoid these issues:

  • Implement monitoring

  • Set rate limits and guardrails

  • Use human-in-the-loop approvals for sensitive workflows

  • Track lineage and metadata

Reverse ETL pipelines must be treated as operational systems, not “just data workflows.”


Additional Insights for Modern Data Engineering

Reverse ETL vs. Operational Data Sync

Reverse ETL is distinct from:

  • Operational data replication (e.g., syncing source → source)

  • CDP activation (customer profiles pushed to channels)

  • API-based application integrations

Reverse ETL specifically activates analytical data.


Reverse ETL and Data Contracts

Because data is pushed into operational systems:

  • Schema stability becomes critical

  • Operational systems often require strict APIs

  • Failures can cause downstream breakage for business users

Reverse ETL pipelines benefit from contracts and strict SLAs.


Reverse ETL and Warehouse Performance

Reverse ETL often triggers:

  • Frequent warehouse queries

  • Large data extractions

  • Incremental sync logic

To avoid performance degradation:

  • Use CDC-style incremental syncs

  • Use materialized views for extraction

  • Consider scheduler throttling to reduce warehouse strain


Security and Governance

Reverse ETL can push sensitive PII or operational data into external SaaS systems.

Think about:

  • GDPR/CCPA compliance

  • Data minimization

  • Row-level filtering

  • Fine-grained access control in target systems

Embedding governance into reverse ETL is essential.


Reverse ETL in the Lakehouse Era

Modern table formats (Delta/Iceberg/Hudi) are making it easier to build reverse ETL pipelines from lakehouses (not just warehouses). This allows:

  • Streaming activation

  • Near-real-time customer 360 updates

  • Model predictions delivered within minutes

Reverse ETL is becoming a natural extension of the lakehouse architecture.


Reverse ETL—despite the misleading name—enables the operationalization of analytics by syncing enriched warehouse data back into the business systems where it creates real value. Although the practice is well-established, modern tools and lakehouse technologies have turned it into a first-class data engineering pattern. Teams must remain cautious about feedback loops, governance, and reliability—because reverse ETL pipelines now sit directly in the critical path of operational decision-making.


Last updated