Are workflow orchestrators used with streaming?


Workflow orchestrators (Airflow, Prefect, Dagster, Step Functions) are not usually used to run streaming jobs, but they are used to start, monitor, deploy, or manage them.

Here’s the breakdown.


Why Orchestrators Are NOT Usually Used to Run Streaming Jobs

Streaming jobs (Flink, Spark Structured Streaming, Kafka Streams) are:

  • long-running

  • continuously running (24/7)

  • event-driven, not schedule-driven

  • stateful (continuous checkpoints, watermarks)

  • not meant to be killed and restarted every few minutes

Workflow orchestrators are designed for:

  • Batch tasks

  • Finite jobs

  • Retry → success/failure → exit

  • DAG dependencies

These assumptions don’t match streaming workloads.

Example:

Airflow expects tasks to:

But a streaming job behaves like:

Airflow will time out, retry, fail the DAG, or screw up state management.


So How Do People Manage Streaming Jobs Instead?

They use streaming-native orchestration and cluster managers, such as:

Flink’s built-in JobManager

  • Submits jobs

  • Manages checkpoints

  • Restarts with exactly-once guarantees

Spark Structured Streaming on Databricks / EMR / Kubernetes

  • Jobs run as long-lived services

  • Checkpoint directories restore state

  • Cluster managers handle failures, scaling

Kafka Streams

  • Self-managed inside the client library

  • Scaling = add/remove instances

  • Failover built-in

Cloud-native streaming services

  • Google Dataflow (Apache Beam)

  • AWS Kinesis Data Analytics

  • Azure Stream Analytics These are managed streaming runtimes.

These systems handle:

  • auto-restarts

  • scaling

  • state recovery

  • continuous execution

Workflow orchestrators cannot replace these.


When Orchestrators ARE Used with Streaming Jobs

Even though orchestrators do not run streaming jobs, they do orchestrate around them.

Deploying a streaming job

  • Ship new JAR/py file to cluster

  • Submit job to Flink/Spark/Kubernetes

  • Trigger rolling upgrade

Monitoring

  • DAG tasks ping job health API

  • Alert if the job is not running

Bootstrapping

  • Run a preparation step:

    • create topics

    • create checkpoint buckets

    • upload configs

Periodic side jobs

  • Checkpoint cleanup

  • Daily table optimizations

  • Partition compaction

  • Metric aggregation

Reprocessing jobs

Orchestrator may trigger:

  • “run this job over historical data”

  • “backfill from S3 for last 30 days”

Cleanup / Stop / Resume

Orchestrator helps coordinate operations, but doesn't run the core stream itself.


Real-World Example

Spark Structured Streaming pipeline

  • Streaming service: runs 24/7 on Databricks jobs or Kubernetes

  • Airflow orchestration:

    • Deploy new version of streaming code

    • Trigger restart with zero downtime

    • Run daily maintenance (optimize Delta tables)

    • Trigger a historical backfill job (batch)

Flink pipeline

  • Flink JobManager handles the actual stream

  • Airflow deploys/updates the job, monitors metrics

Kafka Streams application

  • Runs as microservice

  • Prefect monitors health and handles restarts


Last updated