Apache Beam - both batch and stream processing unified
Learning resources: https://beam.apache.org/get-started/resources/learning-resources/
Getting started from Apache Spark: https://beam.apache.org/get-started/from-spark/
Tour of Apache Beam: https://tour.beam.apache.org/
Programming guide (how to create data processing pipelines with Beam SDK): https://beam.apache.org/documentation/programming-guide/
Basic primitives:
PCollections - massive (not necessarily) datasets on which parallel transformations can be performed
PTransforms - you apply this on PCollections and get other PCollections as output
Types of transformations: element-wise, grouping, composite
Performance in Apache Beam
(read this to learn about performance optimization techniques of Graph Optimization and Bloom Filters)
To implement exactly-once shuffle delivery, a catalog of record IDs is stored in each receiver key. For every record that arrives, Dataflow looks up the catalog of IDs already seen to determine whether this record is a duplicate.
Every output from step to step is checkpointed to storage to ensure that the generated record IDs are stable.
However, unless implemented carefully, this process would significantly degrade pipeline performance for customers by creating a huge increase in reads and writes. Thus, for exactly-once processing to be viable for Dataflow users, that I/O has to be reduced, in particular by preventing I/O on every record. Dataflow achieves this goal via two key techniques: graph optimization and Bloom Filters.
Using Beam for Lakehouse architecture
Beam is an execution model, not tied to Dataflow, and it can absolutely write to or integrate with Delta/Iceberg/Hudi.
But there are nuances.
👀 Do people commonly use Beam with Lakehouses?
Not often — and that’s for ecosystem reasons, not technical ones.
Why Beam isn’t the typical choice:
No native, official connectors for Delta/Iceberg/Hudi
Flink and Spark have first-class integrations
Beam usually requires custom sinks or intermediate formats
Most Beam deployments run on Google Dataflow, and Google’s default storage is BigQuery, not Lakehouse tables.
Lakehouse momentum is outside Google Cloud
Delta → Databricks, open-source ecosystem
Iceberg → Netflix, Apple, AWS, Snowflake, Flink community
Hudi → Uber, AWS EMR, Flink
Beam simply isn’t the primary engine in these ecosystems, it's well integrated with Google Bigquery. But again — technically it works.
Last updated