# Time-based scheduling

***

## Time-based scheduling

In Airflow 3, scheduling has become much more explicit. You are seeing a shift from simple strings (like `@daily`) to **Timetables**. Timetables are the internal engines that calculate exactly when the next "Data Interval" should occur.

***

### Trigger-based approaches

#### The "Frequency" Approach (`DeltaTriggerTimetable`)

This is used when you want a fixed rhythm that **ignores the calendar grid**.

```python
schedule=DeltaTriggerTimetable(pendulum.duration(days=2))
```

* **How it works:** It says "Run exactly 48 hours after the last run finished."
* **The Key Difference:** Unlike a "Cron" job that runs every other day at midnight, a Delta trigger is relative. If a run is delayed, the next one is calculated from that point.
* **Mental Model:** Think of it as a **countdown timer**. Once it hits zero, it runs, and then flips the hourglass over for another 2 days.

***

#### The "Calendar" Approach (`CronTriggerTimetable`)

This is the traditional way to schedule. It follows the **clock and calendar**.

```python
schedule=CronTriggerTimetable("@daily", timezone="UTC")
```

* **How it works:** It looks at the wall clock. `@daily` means "When the clock hits 00:00:00."
* **Timezone Awareness:** This is the big upgrade in Airflow 3 syntax. You can explicitly attach a timezone to the schedule. This prevents "Daylight Savings" from shifting your pipeline runs.
* **Mental Model:** Think of this as an **alarm clock**. It doesn't care how long the last job took; it only cares what time the clock says right now.

***

#### The "Execution Window" (`start_date` and `end_date`)

When you combine these with `catchup=True`, you define a specific "slice of time" that Airflow is responsible for.

```python
start_date=pendulum.datetime(2024, 1, 1),
end_date=pendulum.datetime(2024, 1, 5),
catchup=True,
```

For example, here is what Airflow will do:

| **Execution Attempt** | **Logical Date (Start of Interval)** | **Does it run?**     |
| --------------------- | ------------------------------------ | -------------------- |
| **Run 1**             | 2024-01-01                           | Yes                  |
| **Run 2**             | 2024-01-03                           | Yes (2 days later)   |
| **Run 3**             | 2024-01-05                           | Yes (Final run)      |
| **Run 4**             | 2024-01-07                           | No (Past `end_date`) |

* `end_date`: This is your "**Off Switch**." It’s incredibly useful for one-time migrations (e.g., "I need to move all data from the old database for the month of January, then stop").
* **The Catchup Effect:** Because `catchup=True`, the moment you unpause this DAG, Airflow will realize it is 2026 and it "missed" those 2024 dates. It will hammer your system by running all 3 instances immediately.

#### Summary Table: Which one to use?

| **Scenario**                                                               | **Best Choice**                                   |
| -------------------------------------------------------------------------- | ------------------------------------------------- |
| "I want to run this at 2:00 AM every Tuesday."                             | `CronTriggerTimetable`                            |
| "I just want this to run every 30 minutes, I don't care about the clock."  | `DeltaTriggerTimetable`                           |
| "I need to process exactly one year of old data and then never run again." | `CronTrigger` + `start/end_date` + `catchup=True` |

#### A Note on the "Airflow Gap"

Always remember: **Airflow runs at the END of the interval.**

If your schedule is `@daily` and the logical date is **Feb 1st**, the task actually starts on **Feb 2nd**. It waits for the day to "complete" so it can process all the data that was generated *during* Feb 1st.

***

### Incremental processing with data intervals (interval-based approaches)

This highlights the fundamental shift from "running a script on a schedule" to "processing a specific window of time."

In Airflow, this is the difference between an **Interval-based** approach (where the orchestrator tells the code what to do) and a **Trigger-based** approach (where the code has to figure it out itself).

#### Interval-Based Scheduling (The "Airflow Way")

In this model, time is treated like a series of **fixed buckets**. Airflow manages these buckets and hands them to your task one by one.

* **The Mechanism:** When a DAG is scheduled for `@daily`, Airflow creates a "Data Interval." For February 4th, the interval is `[2026-02-04 00:00, 2026-02-05 00:00]`.
* **The "Exact Information":** Airflow passes variables like `data_interval_start` and `data_interval_end` (or the older `execution_date`) into your code.
* **The Benefit:** Your code doesn't need to know what time it is *now*. It only needs to say: "Give me all data between `start` and `end`."
* **Result:** This makes **backfilling** possible. You can run the Feb 4th bucket in mid-March, and it will still process the exact same data.

***

#### Trigger-Based Scheduling (The "Cron Way")

This is how traditional systems (or basic Python scripts) work. The system just says "It is 2:00 AM, wake up and run."

* **The Mechanism:** The task is triggered at a specific point in time.
* **The Task's Burden:** The code has to ask the system: "What time is it now?" and then calculate: "Okay, I guess I should look for data from the last 24 hours."
* **The Risk:** If the trigger fails and you run it 6 hours late, a simple "look back 24 hours" logic will **skip 6 hours of data** or double-count data from the next day.
* **Result:** This is much harder to audit or replay because the logic is "relative" to the moment of execution, not "absolute" to a data bucket.

***

#### Comparison: Incremental Processing

| **Feature**                 | **Interval-Based (Incremental)**                 | **Trigger-Based (Point-in-time)**                    |
| --------------------------- | ------------------------------------------------ | ---------------------------------------------------- |
| **Who defines the window?** | The Orchestrator (Airflow)                       | The Task (Your Python code)                          |
| **Variables provided**      | `data_interval_start` / `end`                    | Usually just "Now"                                   |
| **Backfill reliability**    | High. You can re-run any specific day perfectly. | Low. Hard to re-run "yesterday" if it's now "today." |
| **Gaps in data**            | Virtually impossible; intervals are contiguous.  | Likely if a job is delayed or fails.                 |

***
