Apache Beam - both batch and stream processing unified

Learning resources: https://beam.apache.org/get-started/resources/learning-resources/

Getting started from Apache Spark: https://beam.apache.org/get-started/from-spark/

Tour of Apache Beam: https://tour.beam.apache.org/

Programming guide (how to create data processing pipelines with Beam SDK): https://beam.apache.org/documentation/programming-guide/

Basic primitives:

PCollections - massive (not necessarily) datasets on which parallel transformations can be performed
PTransforms - you apply this on PCollections and get other PCollections as output
- Types of transformations: element-wise, grouping, composite

Performance in Apache Beam

(read this to learn about performance optimization techniques of Graph Optimization and Bloom Filters)

To implement exactly-once shuffle delivery, a catalog of record IDs is stored in each receiver key. For every record that arrives, Dataflow looks up the catalog of IDs already seen to determine whether this record is a duplicate.

Every output from step to step is checkpointed to storage to ensure that the generated record IDs are stable.

However, unless implemented carefully, this process would significantly degrade pipeline performance for customers by creating a huge increase in reads and writes. Thus, for exactly-once processing to be viable for Dataflow users, that I/O has to be reduced, in particular by preventing I/O on every record. Dataflow achieves this goal via two key techniques: graph optimization and Bloom Filters.

Using Beam for Lakehouse architecture

Beam is an execution model, not tied to Dataflow, and it can absolutely write to or integrate with Delta/Iceberg/Hudi.

But there are nuances.

👀 Do people commonly use Beam with Lakehouses?

Not often — and that’s for ecosystem reasons, not technical ones.

Why Beam isn’t the typical choice:

No native, official connectors for Delta/Iceberg/Hudi
- Flink and Spark have first-class integrations
- Beam usually requires custom sinks or intermediate formats
Most Beam deployments run on Google Dataflow, and Google’s default storage is BigQuery, not Lakehouse tables.
Lakehouse momentum is outside Google Cloud
- Delta → Databricks, open-source ecosystem
- Iceberg → Netflix, Apple, AWS, Snowflake, Flink community
- Hudi → Uber, AWS EMR, Flink

Beam simply isn’t the primary engine in these ecosystems, it's well integrated with Google Bigquery. But again — technically it works.

PreviousHow groupBy works NextContent from the book "Streaming Systems"

Last updated 2 months ago

hashtagUsing Beam for Lakehouse architecture

hashtag👀 Do people commonly use Beam with Lakehouses?

hashtagWhy Beam isn’t the typical choice:

Using Beam for Lakehouse architecture

👀 Do people commonly use Beam with Lakehouses?

Why Beam isn’t the typical choice: