# Apache Beam - both batch and stream processing unified

Learning resources: <https://beam.apache.org/get-started/resources/learning-resources/>

Getting started from Apache Spark: <https://beam.apache.org/get-started/from-spark/>

Tour of Apache Beam: <https://tour.beam.apache.org/>

Programming guide (how to create data processing pipelines with Beam SDK): <https://beam.apache.org/documentation/programming-guide/>

***

***

Basic primitives:

* PCollections - massive (not necessarily) datasets on which parallel transformations can be performed
* PTransforms - you apply this on PCollections and get other PCollections as output
  * Types of transformations: element-wise, grouping, composite

***

**Performance in Apache Beam**

(read this to learn about performance optimization techniques of Graph Optimization and Bloom Filters)

To implement exactly-once shuffle delivery, a catalog of record IDs is stored&#x20;in each receiver key. For every record that arrives, Dataflow looks up the&#x20;catalog of IDs already seen to determine whether this record is a duplicate. <br>

Every output from step to step is checkpointed to storage to ensure that the&#x20;generated record IDs are stable. <br>

However, unless implemented carefully, this process would significantly&#x20;degrade pipeline performance for customers by creating a huge increase in&#x20;reads and writes. Thus, for exactly-once processing to be viable for Dataflow&#x20;users, that I/O has to be reduced, in particular by preventing I/O on every&#x20;record.\
Dataflow achieves this goal via two key techniques: graph optimization and&#x20;Bloom Filters.

***

### Using Beam for Lakehouse architecture

Beam is an execution *model*, not tied to Dataflow, and it can absolutely write to or integrate with Delta/Iceberg/Hudi.

But there are nuances.

#### 👀 Do people commonly use Beam with Lakehouses?

Not often — and that’s for *ecosystem reasons*, not technical ones.

#### Why Beam isn’t the typical choice:

1. **No native, official connectors** for Delta/Iceberg/Hudi
   * Flink and Spark have first-class integrations
   * Beam usually requires custom sinks or intermediate formats
2. **Most Beam deployments run on Google Dataflow**,\
   and Google’s default storage is BigQuery, not Lakehouse tables.
3. **Lakehouse momentum is outside Google Cloud**
   * Delta → Databricks, open-source ecosystem
   * Iceberg → Netflix, Apple, AWS, Snowflake, Flink community
   * Hudi → Uber, AWS EMR, Flink

Beam simply isn’t the primary engine in these ecosystems, it's well integrated with Google Bigquery.\
But again — *technically it works.*

***
