Workflow orchestration


Workflow orchestrators are the control plane for data pipelines — they're systems that schedule, execute, monitor, and manage dependencies between data processing tasks.

What They Actually Do

At their core, orchestrators solve a deceptively simple problem: "Run step B only after step A succeeds, handle failures gracefully, and do this reliably at scale." They provide:

Dependency management — Ensuring tasks run in the correct order based on their dependencies, whether that's task-to-task or data-to-data dependencies.

Scheduling — Running workflows on time-based triggers (cron-like), event-based triggers (new file arrives), or on-demand.

Error handling and retries — Automatically retrying failed tasks, sending alerts, and managing partial failures without corrupting your data state.

Observability — Logging execution history, tracking task duration, visualizing pipeline structure, and monitoring data freshness.

Resource management — Distributing work across compute resources, managing concurrency, and preventing resource contention.

Why DataOps Needs Orchestration

In modern data operations, you're typically dealing with:

  • Complex dependencies: Your marketing dashboard depends on cleaned customer data, which depends on raw database extracts, which depends on overnight batch jobs

  • Heterogeneous systems: You're moving data between databases, data warehouses, APIs, object storage, and ML platforms

  • Scale and reliability requirements: Pipelines need to handle terabytes of data and run reliably every day at 2 AM

  • Multiple stakeholders: Data engineers, analysts, and scientists all need their workflows coordinated

Without orchestration, you end up with brittle cron jobs, manual interventions, unclear dependencies, and "works on my machine" syndrome at the infrastructure level.


Common Patterns

ELT pipelines — Extract from sources, load into warehouse, transform with dbt (often orchestrated together)

ML pipelines — Feature engineering → model training → validation → deployment, with different cadences for each stage

Event-driven workflows — New data arrives in S3, triggers processing pipeline, updates downstream dashboards

Backfilling — Re-running historical data through updated pipeline logic without breaking production


The Orchestrator Landscape

The choice of orchestrator often depends on your use case:

  • Airflow: The dominant player, highly flexible, massive community, but can be operationally heavy

  • Dagster: Modern asset-centric approach, great for data-aware workflows, strong typing and testing

  • Kestra

  • Mage.ai

  • Prefect: Developer-friendly, dynamic workflows, good Python-native experience

  • Temporal: For complex stateful workflows, microservices orchestration

  • dbt Cloud: Specialized for transformation workflows in the analytics engineering space

  • Cloud-native options (AWS Step Functions, GCP Workflows, Azure Data Factory): Integrated with cloud ecosystems, serverless


What Makes a Good Orchestrator in DataOps

Declarative definitions — Pipeline logic in version-controlled code, not clicking through UIs

Strong observability — You need to know not just if tasks ran, but whether the data is correct

Flexible execution — Run locally for development, in containers for production, across different environments

Data awareness — Understanding data lineage, freshness, and quality, not just task success/failure

Reasonable operational overhead — It shouldn't take a team of platform engineers just to keep the orchestrator running


Task-based and Asset-based orchestrators

The distinction between task-based and asset-based orchestrators is fundamental to how you model and think about data workflows.

The Core Concept: Orchestration and DAGs

Data orchestration serves as the central nervous system of modern data platforms. It manages dependencies to ensure data moves reliably from ingestion to analytics.

At the heart of orchestration is the Directed Acyclic Graph (DAG).

  • Directed: The workflow moves in one direction (upstream to downstream).

  • Acyclic: The workflow cannot loop back on itself (no infinite cycles).

  • Graph: A mathematical structure consisting of nodes (steps or assets) and edges (dependencies).

Two Primary Paradigms

The distinction between task-based and asset-based orchestrators is fundamental to how you model and think about data workflows.

A. Task-Centric Orchestration (The "Imperative" Approach)

  • Focus: The Verbs (Actions).

  • Mental Model: You are choreographing a sequence of operations. The pipeline logic is the star, and data assets are merely side effects of task execution.

  • Philosophy: "Do this, then do that." You explicitly define the control flow.

  • Examples: Apache Airflow, Prefect, Luigi.

How it works:

You define a DAG of tasks where dependencies are execution-based (e.g., "Task B runs after Task A completes"). The system focuses on the machinery of execution:

  1. Gather Ingredients

  2. Mix Ingredients

  3. Bake

Pros & Cons:

This approach is highly flexible and can orchestrate non-data tasks (e.g., sending emails, spinning up servers). However, the orchestrator often lacks visibility into what was actually produced. If the "Mix" task succeeds but produces an empty bowl, the "Bake" task will still run, potentially wasting resources or corrupting downstream data.

B. Asset-Centric Orchestration (The "Declarative" Approach)

  • Focus: The Nouns (Data Assets).

  • Mental Model: You are declaring a graph of data products. The assets (tables, files, ML models) are the stars/first-class citizens, and the tasks are just the machinery to materialize them.

  • Philosophy: "I want this specific table to exist." The system figures out the necessary steps to create or update that asset based on its lineage.

  • Examples: Dagster, dbt (partially).

How it works:

You specify the desired end state, and dependencies are defined between assets (e.g., "Table B depends on Table A existing"):

  1. Asset: Cookie Dough (Defined by mixing wet & dry ingredients).

  2. Asset: Chocolate Chip Dough (Defined by adding chips to Cookie Dough).

  3. Asset: Fresh Cookies (Defined by baking the dough).

Pros & Cons:

This approach offers superior observability (tracking data freshness and lineage) and reusability (intermediate assets like Cookie Dough can be used for Peanut Cookies without rewriting logic). It also allows for selective materialization: if you need to fix one table, you can rebuild just that asset and its downstream dependencies, whereas task-centric approaches often require running the full pipeline.

Technical Nuances & Practical Impact

To provide a complete architectural view, it is important to understand the practical implications of these models:

  • Imperative vs. Declarative: Task-centric is imperative (you say how to run it); Asset-centric is declarative (you say what you want).

  • Root Cause Analysis: Asset-centric orchestration automatically generates lineage graphs. If a CEO's dashboard is incorrect, you can trace the graph backward to find exactly which "ingredient" (upstream table) was corrupted.

  • Evolution of Tools: The lines are blurring. Modern task-centric tools (like Airflow 2.0+) have introduced features like "Datasets" and "XComs" to pass metadata between tasks, attempting to bridge the gap. However, their fundamental design still prioritizes execution order over data state.

Other Architectural Patterns

Beyond the Task vs. Asset dichotomy, modern engineering employs several other orchestration strategies:

  • Event-Driven Orchestration:

    Instead of running on a schedule (e.g., "Daily at 9 AM"), workflows are triggered by specific events, such as a file landing in an S3 bucket triggering a Lambda function. This is essential for low-latency or streaming data pipelines.

  • Serverless Orchestration:

    Utilizes managed cloud services (like AWS Step Functions or Google Cloud Workflows) to abstract the infrastructure entirely. You define states and transitions using JSON/YAML, and the cloud provider handles the compute resources.

  • Hybrid Orchestration:

    The industry standard often involves mixing methods. For example, a "Task-centric" orchestrator (like Airflow) might trigger a "Declarative" transformation job (like dbt) and wait for the assets to update. This combines the operational flexibility of tasks with the data quality benefits of assets.


The following visualization depicts the Asset-Centric approach. Notice how the Base Cookie Dough acts as a central node that branches out. This highlights the reusability of assets: by creating the base dough once, we can drive two different downstream workflows (Chocolate Chip and Peanut) simultaneously.


Example: Airflow + dbt

In this hybrid model, Apache Airflow acts as the orchestrator (the "General Contractor"), managing the timeline and infrastructure, while dbt acts as the transformation engine (the "Specialized Architect"), handling the internal logic of the data models.

The Division of Labor

To implement this effectively, you must strictly define the responsibilities of each tool to avoid overlap.

Feature

Apache Airflow (Task-Centric)

dbt (Asset-Centric)

Primary Role

Orchestration & Ingestion. Moving data from Point A to Point B.

Transformation. Turning raw data into business logic inside the warehouse.

The "Verbs"

Extract, Load, Trigger, Alert, Retry.

Select, Group By, Join, Test, Document.

Scope

External systems (APIs, SFTP, AWS S3) Data Warehouse.

Strictly inside the Data Warehouse (Snowflake, BigQuery, Redshift).

Language

Python

SQL (with Jinja templating)

The Hybrid Workflow: Step-by-Step

Here is how a typical pipeline runs in a hybrid setup:

  1. Ingestion (Airflow):

    Airflow wakes up on a schedule (e.g., 2:00 AM). It runs a Python operator to fetch raw JSON data from a CRM API and dumps it into the Data Warehouse as a "Raw" table.

    • Status: The data is loaded but messy.

  2. The Hand-off (Airflow triggers dbt):

    Once the ingestion task succeeds, Airflow triggers a task to run dbt.

    • Command: Usually dbt run or dbt build.

  3. Transformation (dbt):

    dbt takes over. It reads the "Raw" table (defined as a Source) and runs a dependency graph of SQL models:

    • Staging: Cleans column names, casts data types.

    • Intermediate: Joins the CRM data with Marketing data.

    • Marts: Aggregates the data for the CFO’s dashboard.

  4. Validation (dbt):

    dbt runs tests (e.g., unique, not_null) to ensure the data is accurate.

  5. Completion (Airflow):

    Airflow detects that dbt finished successfully. It might then run a final task, such as sending a Slack notification to the team: "Daily Data Refresh Complete."

Integration Patterns: Solving the "Black Box" Problem

The biggest challenge in this hybrid approach is visibility.

The Naive Approach (The Black Box):

You use Airflow's BashOperator to run the command dbt run.

  • Problem: Airflow sees this as one single task. If you have 500 dbt models and one fails, Airflow just says "Task Failed." You have to dig into logs to find out which table broke.

The Advanced Approach (DAG Decomposition):

You use advanced libraries (like Astronomer Cosmos or custom Python parsers) that read dbt's manifest.json file.

  • Solution: This automatically converts every single dbt model into its own Task in the Airflow DAG.

  • Result: You get the best of both worlds. You see the asset lineage (from dbt) visualized as execution tasks (in Airflow).

The diagram above illustrates the synergy:

  1. Airflow (Grey Squares): Handles the "Extract API" and "Load" steps because dbt cannot move data from an API to a database.

  2. dbt (Orange Circles): Takes over once the data is in the warehouse. It handles the complex logic of joining Users and Orders to create Sales.

  3. Lineage: Even though Airflow triggered it, the dependencies inside the orange section are defined by the data assets (Asset-centric), while the outer grey flow is defined by the schedule (Task-centric).


Triggers in workflow orchestrators

In workflow orchestrators (like Apache Airflow, Prefect, Dagster, or Temporal), a trigger is the mechanism that initiates the execution of a workflow (often called a DAG or Flow).

Broadly speaking, triggers fall into four main categories: Time-based, Event-based, Data-based, and Manual.

Here is a breakdown of the types of triggers you will encounter in data engineering.


1. Time-Based Triggers (Scheduled)

This is the traditional "batch" approach. The orchestrator is configured to run a workflow at a specific recurring time, regardless of whether there is new data or not.

  • Cron Schedules: The most common method. You define a Cron expression (e.g., 0 9 * * * for "every day at 9 AM").

  • Fixed Intervals: Running a job every minutes or hours (e.g., @hourly or timedelta(minutes=30)).

  • Backfill Capability: A unique feature of time-based triggers in tools like Airflow. If you define a start date in the past, the orchestrator can "catch up" and trigger runs for all the missed intervals between then and now.

Best for: Daily reporting, nightly ETL jobs, and regulatory backups where consistency is more important than latency.

2. Event-Based Triggers (Reactive)

Event-based triggers initiate a workflow immediately in response to an external action. This moves away from "batch" thinking toward "real-time" or "near real-time" automation.

  • Webhooks / REST API: An external system sends an HTTP POST request to the orchestrator’s API to trigger a specific DAG.

    • Example: A user fills out a form on a website, which hits the API to trigger a "Send Welcome Email" workflow.

  • Message Queues (Pub/Sub): The orchestrator subscribes to a topic (like Kafka or RabbitMQ). When a specific message arrives, a worker picks it up and executes the workflow.

  • File Arrival: Triggering a pipeline the moment a file lands in an S3 bucket or FTP server. (Note: In some tools, this is handled via Sensors, described below).

Best for: CI/CD pipelines, user-facing applications, and responding to unpredictable external inputs.

3. Sensor / Polling Triggers

Technically a subset of event-based triggers, but they function differently mechanically. instead of receiving a push (like a webhook), the orchestrator actively asks (polls) if a condition is met.

  • File Sensors: The workflow starts (usually on a schedule), but the first task is a "Sensor" that checks "Is the file in S3 yet?" It loops and sleeps until the file appears, then lets the rest of the DAG proceed.

  • Database/SQL Sensors: Checks if a specific row exists or a table has been updated before proceeding.

  • Cross-DAG Sensors: DAG B waits to run until it sees that DAG A has completed successfully.

Note: Sensors can be resource-intensive because they occupy a worker slot just to "wait." Modern orchestrators (like Airflow 2.0+) use Deferrable Operators (Async) to release the worker slot while waiting, saving costs.

4. Data-Aware / Asset-Based Triggers

This is the modern standard for data platforms (popularized by Dagster and Airflow 2.4+ "Datasets"). Instead of relying on time or tight coupling, workflows trigger based on the state of the data itself.

  • Producer-Consumer Logic:

    • Workflow A updates a table called clean_sales_data.

    • Workflow B is configured to run whenever clean_sales_data is updated.

  • Decoupling: Workflow A doesn't need to know Workflow B exists. It just updates the asset. This prevents the "fragile chain" problem where you try to time DAG B to start 10 minutes after DAG A, hoping DAG A finished in time.

Best for: Complex data meshes and modern data lakes where dependencies between tables are more important than the time of day.

5. Manual Triggers (Ad-Hoc)

Sometimes automation isn't the goal.

  • UI Trigger: A user logs into the dashboard and clicks the "Play" button.

  • CLI Trigger: An engineer runs a command line script to force a run (e.g., airflow dags trigger example_dag).

Best for: Testing, debugging, re-running failed jobs, or irregular tasks like "End of Year" processing.


Summary Table

Trigger Type

Mechanism

Latency

Use Case

Scheduled

Cron / Time

High (waits for slot)

Nightly Reports, Backups

Event-Based

Webhook / API

Low (Real-time)

CI/CD, User Actions

Sensor

Polling Loop

Variable

Waiting for S3 files / SQL rows

Data-Aware

Dataset Update

Medium (Near Real-time)

Modern ETL, Data Mesh


Last updated