Batch architectures
Architectures where batch data processing is involved
Traditional ETL Architecture in Modern Data Engineering
Batch data processing has evolved significantly, but traditional ETL (Extract → Transform → Load) remains one of the foundational patterns in data engineering. Even as cloud warehouses and ELT workflows grow in popularity, ETL is still widely used when data quality, structure, and governance are top priorities. Here’s a clear breakdown of what ETL really is, when it makes sense, and how it works today.
What ETL Is
ETL is a workflow where data is:
Extracted from source systems
Transformed before it reaches the warehouse
Loaded into a curated, analytics-ready schema
Unlike ELT, all heavy lifting—cleaning, joining, validating, enriching—happens outside the warehouse.
Typical Technology Stack
On-premises: Informatica, Oracle Data Integrator (ODI), Pentaho Cloud equivalents: AWS Glue, Google Dataflow (batch), Azure Data Factory Mapping Data Flows
When ETL Makes Sense
ETL is used when:
Schemas are stable and governed
You need a clean, curated data warehouse (star/snowflake)
Transformations are complex, business-logic-driven, or require strict validation
Downstream users expect consistent, high-quality BI datasets
Strengths
Produces a well-modeled, high-trust warehouse
Enforces data quality early in the pipeline
Optimized for reporting and analytics teams
Limitations
Rigid and slow to change (schema updates can be painful)
Hard to reprocess historical data if raw snapshots aren’t stored
Transformation logic can become tightly coupled to ETL tooling
How ETL Actually Runs
A traditional ETL flow looks like:
The key idea: The warehouse only receives processed, ready-to-use tables.
Transformations don’t happen “inside” the warehouse—they happen in external compute engines (ETL servers, Spark clusters, Glue jobs, etc.).
Does ETL Require Something Like S3?
Before the Cloud
Classic ETL tools operated without object storage. They typically used:
In-memory transformations
Temporary staging databases
ETL servers dedicated to data processing
The flow was simple:
In the Cloud Era
ETL often incorporates object storage like S3, ADLS, or GCS as a staging layer. A typical pattern today is:
But the defining characteristic remains the same:
✔ Transformations happen before loading into the warehouse ✔ The warehouse holds only curated, processed data
How Transformations Occur in ETL Pipelines
1. In-Flight Transformations
Data is transformed as it moves through the pipeline.
Common tools: AWS Glue, Matillion, Spark batch jobs, Airbyte with transforms, Talend.
Example:
2. Staged Transformations
For heavy workloads, raw data may be landed first, transformed, then written back before loading.
Example:
This still counts as ETL because the warehouse only receives data post-transformation.
ETL vs. ELT: The Key Difference
Although the patterns sound similar, the distinction is clear:
ETL
Compute happens outside the warehouse
Warehouse stores only transformed data
Requires external processing tools (Spark, Glue, Matillion, etc.)
ELT
Raw data is loaded first
Transformations run inside the warehouse
SQL-based transformations (typically via dbt)
A simple rule of thumb:
If the warehouse does the transformations, it’s ELT. If external compute does the transformations, it’s ETL.
ELT Architecture (Extract → Load → Transform)
If ETL is the traditional, rule-driven backbone of enterprise analytics, then ELT (Extract → Load → Transform) is its cloud-native successor — designed for agility, scale, and iteration. As cloud warehouses became powerful enough to handle massive data transformations internally, the industry began shifting from “transform first, load later” to “load first, transform inside.”
What ELT Really Means
In ELT, data pipelines follow a simple pattern:
Extract → pull the data from source systems
Load → store it raw, without major preprocessing
Transform → reshape it directly inside the warehouse or lakehouse
This design flips the old ETL model: instead of doing all transformation before the warehouse, you bring data as-is into a storage or compute engine that can scale elastically — usually Snowflake, BigQuery, Databricks, Redshift, or a lakehouse table format.
The warehouse becomes both the storage and the transformation engine.
Why ELT Became the Default for Modern Data Teams
The rise of ELT coincides directly with two industry changes:
Cloud warehouses became fast and cheap at compute
Data teams wanted flexibility instead of rigid schemas
Agile analytics teams now push raw data into the warehouse as quickly as possible, then refine it collaboratively using SQL and tools like dbt or SQLMesh.
This approach supports:
Experimentation (model your data however you want, whenever you want)
Rapid iteration (change your dbt model, re-run it, done)
Re-processing (because you kept the raw layer)
Lineage and versioning (dbt, Delta Lake, Iceberg, etc.)
In ELT, the warehouse is no longer the final stop — it’s the heart of the transformation process.
The Typical ELT Stack
A “classic” ELT stack looks like:
Warehouse: Snowflake, BigQuery, Databricks SQL, Redshift
Transformation Layer: dbt Core or dbt Cloud
Orchestration: Airflow, Dagster, Prefect, or dbt Cloud scheduling
Ingestion: Fivetran, Airbyte, Meltano, Kafka Connect, or custom ingestion jobs
Storage (optional): S3 / ADLS / GCS as raw lake layers
Once raw data lands in the warehouse or lake, all structure and business logic is applied via SQL transformations in dbt.
ELT Limitations
ELT isn’t perfect — its tradeoffs reflect its design philosophy.
1. You Need a Strong Warehouse or Lakehouse
ELT assumes your warehouse can handle:
Large joins
Complex SQL transformations
Expensive compute workloads
Not all teams or budgets can support that.
2. Compute Cost Can Grow
Transformations move from external ETL servers → into your warehouse billing model.
Heavy dbt models, large fact tables, and multi-step DAGs can drive costs if unmanaged.
Best practices like clustering, partitioning, incremental models, and query optimization become essential.
Lambda Architecture: What It Was, Why It Existed, and Why It Faded
For years, Lambda Architecture was the go-to pattern for systems needing both real-time insights and accurate historical recomputation. It emerged when streaming systems weren’t powerful enough to guarantee correctness on their own. Today, its influence remains, but the architecture itself is mostly obsolete.
What Lambda Architecture Tried to Solve
Early big data stacks had a problem:
Batch systems (Hadoop, early Spark) were accurate but slow.
Streaming systems (Storm, early Kafka Streams) were fast but inaccurate and lacked exactly-once guarantees.
Lambda Architecture solved this with a two-path model:
Batch Layer
Stores all historical data (usually in HDFS or S3).
Periodically recomputes results from scratch.
Ensures accuracy, removes drift and errors.
Speed Layer
Consumes streaming data in real time (Kafka + Storm/Spark Streaming/Flink).
Produces low-latency, approximate results until the batch layer catches up.
Serving Layer
Merges:
Real-time output from the speed layer
Batch-computed output from the batch layer
Users always get the freshest possible view.
Where It Was Used
Fraud detection (fast decisions + later corrections)
Real-time dashboards with reconciliation
IoT systems before modern stream processors matured
Tech examples:
Batch: Spark batch on S3 or Hadoop
Speed: Kafka → Spark Streaming / Flink
Serving: Cassandra / HBase / precomputed tables
Why It Faded
Lambda Architecture is infamous for one thing:
You maintain two entire pipelines.
Two codebases
Two engines
Two data models
Two failure modes
Two sets of bugs
And you have to synchronize them.
Modern systems made this unnecessary:
Apache Flink, Kafka Streams, Spark Structured Streaming now provide:
Exactly-once semantics
Backfills
Stateful processing
Reprocessing and replay
Lakehouse architectures (Databricks, Snowflake Iceberg, BigQuery) enable:
Low-latency ingestion
Incremental processing
Reliable batch + streaming unification
This gave rise to the Kappa Architecture, where everything is streaming-first.
Most new architectures use:
Kappa Architecture
Delta Live Tables / Lakehouse streaming
Unified batch+streaming engines
Kappa Architecture: Streaming-First Data Systems Without the Complexity
Kappa Architecture emerged as a reaction to Lambda Architecture’s biggest flaw: two separate pipelines for the same data. Modern stream processors matured enough to make that duplication unnecessary. The idea behind Kappa is simple but powerful:
👉 Everything is a stream — even batch.
Instead of maintaining parallel batch and speed layers, Kappa uses a single streaming pipeline that can replay historical data whenever batch-like processing is needed.
The Core Idea
At the center of Kappa Architecture is the append-only event log — usually Kafka or a similar system. All data flows through it, and downstream processing engines treat the log as the single source of truth.
Here’s the simplified flow:
Streaming Engine
Flink, Kafka Streams, Spark Structured Streaming
Applies transformations continuously
Maintains state with exactly-once guarantees
Reprocessing
No separate batch layer
To recompute historical results, simply rewind and replay the log
Same code, same pipeline → no drift, no duplication
Storage / Serving
Lakehouse tables (Delta, Iceberg, Hudi)
Online stores, materialized views, analytic tables
When Kappa Architecture Works Best
Real-time systems where streaming is the default
Event-driven pipelines
IoT, finance, clickstreams, telemetry, fraud detection
Teams that want a single code path rather than two
It especially shines when:
You want to replay historical data easily
You do frequent schema or pipeline changes
Your system already depends on Kafka as a source of truth
Pros
✔ One code path
No separate batch vs streaming logic. Less maintenance, fewer bugs, easier onboarding.
✔ Reprocessing is trivial
Just reset offsets or re-read the log. Ideal for evolving ML features, metrics, and analytics.
✔ Scales naturally
Streaming engines scale horizontally and handle high-throughput event data.
✔ Fits lakehouses perfectly
Modern Delta/Iceberg/Hudi tables integrate cleanly with streaming ingestion.
Cons
✖ Not great for massive one-off batch jobs
If you need to crunch 10 years of archives once, a big Spark batch job is still simpler.
✖ Dependent on event log retention
If Kafka only retains 7 days of data, you simply cannot reprocess older history. A classic workaround is tiered storage or lakehouse-backed Kafka.
✖ Requires event-first thinking
Systems not designed around events may need to be restructured.
Lakehouse Architecture: The Modern Unified Data Platform
Lakehouse architecture emerged to fix a long-standing problem: data warehouses are great for analytics but expensive and rigid, while data lakes are flexible and cheap but notoriously messy. The Lakehouse combines both worlds — the low-cost storage of a data lake with the management and performance features of a warehouse.
It’s now the dominant architecture behind modern data platforms. It's for both batch and streaming
Why Lakehouses Exist
Historically:
Data warehouses enforced schema, ACID guarantees, governance, and fast SQL — but storing raw data was expensive.
Data lakes allowed dumping raw files cheaply — but lacked schema, transactions, and reliability, making analytics unreliable.
A Lakehouse solves both by adding a table format on top of a data lake that brings ACID guarantees and schema management.
The Core Concept
A Lakehouse is built around open table formats running on cheap object storage (S3, ADLS, GCS):
🔑 Table Formats
Delta Lake (Databricks)
Apache Iceberg
Apache Hudi
These formats introduce:
ACID transactions
Schema enforcement + evolution
Time travel
Partitioning & indexing
Compaction / clustering
Streaming + batch unification
This turns a simple data lake into something that behaves like a warehouse, but at lake scale and cost.
How a Lakehouse Works
1. Raw Layer
Raw files stored in object storage
Often partitioned by date or source
No transformation
2. Bronze / Silver / Gold Layers (most common pattern)
Bronze: raw ingested data
Silver: cleaned, conformed tables
Gold: business-ready aggregates and marts
These layers live as Delta/Iceberg/Hudi tables, enabling incremental updates and safe merges.
3. Compute Engines
A Lakehouse supports multiple engines at the same time:
Spark
Flink
Presto/Trino
Snowflake (Iceberg support)
Data warehouse engines querying Lakehouse tables via connectors
This separation of storage and compute is key.
Benefits of a Lakehouse
✔ 1. Unified batch and streaming
Delta, Iceberg, and Hudi support incremental ingestion, allowing the same table to accept:
Streaming upserts from Kafka
Batch loads from files
Reprocessing / replay
✔ 2. Open, vendor-neutral storage
Unlike warehouses, all data sits in:
S3, ADLS, or GCS
In open file formats (Parquet)
With open metadata (table formats)
✔ 3. ACID reliability
No more “dirty reads” or “partial files.” Transactions guarantee atomic writes.
✔ 4. Lower cost than warehouses
You only pay for compute when engines run, not for continuous warehouse storage.
✔ 5. Supports multiple use cases
BI and SQL analytics
Real-time pipelines
ML and feature stores
CDC ingestion
Data sharing
All from the same source of truth.
Drawbacks / Trade-offs
✖ More operational overhead than a SaaS warehouse
You manage ingestion, compaction, table maintenance.
✖ Compute engine choices can be overwhelming
Spark? Flink? Trino? Snowflake as query engine?
✖ Not “fully managed” (unless on Databricks)
Open Lakehouse deployments require DevOps/Platform Engineering.
When to Use a Lakehouse
Use it when you need:
Large-scale analytics (TB–PB)
Unified streaming + batch pipelines
Low-cost raw data storage
Machine learning workflows
Multi-engine flexibility
Open data that’s not tied to a single vendor
If your world is real-time or event-driven (Kafka, Flink), a Lakehouse complements Kappa architecture extremely well.
Reverse ETL
Reverse ETL refers to the process of taking processed, enriched, or modeled data from an OLAP system—typically a data warehouse or lakehouse—and loading it back into operational source systems such as CRMs, marketing platforms, SaaS tools, or internal applications.
Although the term became popular in the late 2010s, the practice itself has existed for much longer. Data engineers have long built ad-hoc pipelines that push analytical data back into operational systems. What changed is the recognition of this process as a formal category and the rise of dedicated tools that automate it.
Why Reverse ETL Matters
Reverse ETL is about data activation—getting insights into the systems where end users work, not just into dashboards.
Example:
CRM + warehouse → model generates lead scores → sales team needs those scores inside Salesforce/HubSpot, not in a BI dashboard or CSV. Reverse ETL reduces user friction by embedding insights directly into operational workflows.
Because of this, reverse ETL is becoming a core responsibility of data teams, similar to building forward ETL pipelines.
How Reverse ETL Works
Reverse ETL pipelines extract prepared data from the warehouse, transform it if necessary, and load it into operational systems. Many modern tools (open-source or commercial) provide connectors, transformations, and scheduling out of the box.
However, the vendor landscape is evolving rapidly—mergers, acquisitions, and absorption into broader data platforms are expected.
Risks and Considerations
Reverse ETL creates feedback loops between analytics and operational systems. Example:
Pull Google Ads data → model bids → push new bids back into Google Ads → repeat. An error in the model can cause runaway spending or harmful automated behavior.
To avoid these issues:
Implement monitoring
Set rate limits and guardrails
Use human-in-the-loop approvals for sensitive workflows
Track lineage and metadata
Reverse ETL pipelines must be treated as operational systems, not “just data workflows.”
Additional Insights for Modern Data Engineering
Reverse ETL vs. Operational Data Sync
Reverse ETL is distinct from:
Operational data replication (e.g., syncing source → source)
CDP activation (customer profiles pushed to channels)
API-based application integrations
Reverse ETL specifically activates analytical data.
Reverse ETL and Data Contracts
Because data is pushed into operational systems:
Schema stability becomes critical
Operational systems often require strict APIs
Failures can cause downstream breakage for business users
Reverse ETL pipelines benefit from contracts and strict SLAs.
Reverse ETL and Warehouse Performance
Reverse ETL often triggers:
Frequent warehouse queries
Large data extractions
Incremental sync logic
To avoid performance degradation:
Use CDC-style incremental syncs
Use materialized views for extraction
Consider scheduler throttling to reduce warehouse strain
Security and Governance
Reverse ETL can push sensitive PII or operational data into external SaaS systems.
Think about:
GDPR/CCPA compliance
Data minimization
Row-level filtering
Fine-grained access control in target systems
Embedding governance into reverse ETL is essential.
Reverse ETL in the Lakehouse Era
Modern table formats (Delta/Iceberg/Hudi) are making it easier to build reverse ETL pipelines from lakehouses (not just warehouses). This allows:
Streaming activation
Near-real-time customer 360 updates
Model predictions delivered within minutes
Reverse ETL is becoming a natural extension of the lakehouse architecture.
Reverse ETL—despite the misleading name—enables the operationalization of analytics by syncing enriched warehouse data back into the business systems where it creates real value. Although the practice is well-established, modern tools and lakehouse technologies have turned it into a first-class data engineering pattern. Teams must remain cautious about feedback loops, governance, and reliability—because reverse ETL pipelines now sit directly in the critical path of operational decision-making.
Last updated