Dataflow


Dataflow

Dataflow (Apache Beam)

  • What it is: The processing engine that reads from Pub/Sub.

  • The Philosophy: "Unified Batch and Stream." You write your code once (using the Apache Beam SDK). You can run that same code on a streaming source (Pub/Sub) or a batch source (Cloud Storage files).

  • Data Engineer Note: This is Google's pride and joy. It handles "late data" (e.g., a mobile phone loses signal and sends data 1 hour late) better than almost any other tool.

  • AWS Equivalent: Managed Flink (for streaming) or Glue (for batch).


Last updated