How Airflow interacts with external databases
Templating External Files (template_searchpath)
template_searchpath)Usually, we write SQL queries directly inside our Python code. However, for long queries, it is better to keep them in separate .sql files.
The "Extension" Rule: Airflow operators have a property called
template_ext. For theSQLExecuteQueryOperator, this is['.sql']. If you pass a string ending in.sqlto thesqlargument, Airflow won't treat it as a query; it will look for a file with that name.The "Searchpath" Problem: By default, Airflow only looks in the folder where your DAG file lives. If you store your SQL files in a different folder (like
/scriptsor/tmp), you must tell the DAG where to look by settingtemplate_searchpath=['/path/to/files'].
The Setup: How to query sql files
Create a file named
insert_taxi.sqlin your/sqlfolder.Inside that file, write your query:
INSERT INTO taxi_data VALUES ('{{ ds }}', 'Yellow Cab', 15.50);In your DAG, simply point to that file.
with DAG(
dag_id="taxi_ingestion",
template_searchpath=["/usr/local/airflow/sql"], # Tell Airflow where to look
...
) as dag:
insert_task = SQLExecuteQueryOperator(
task_id="insert_from_file",
conn_id="my_postgres",
sql="insert_taxi.sql" # Airflow sees the .sql extension and loads the file
)Why the file approach is better
Syntax Highlighting: Your code editor (VS Code, etc.) will give you proper SQL coloring and error checking inside a
.sqlfile, which it won't do inside a Python string.Cleanliness: Your DAG file stays short and easy to read.
Security: It’s harder to accidentally mess up the Python quotes (
"or') when the SQL is in its own home.
How it works "Inside" the Task
Regardless of which way you choose, here is the sequence of events inside that single task instance:
If Inline: Airflow takes the string.
If File: Airflow goes to
template_searchpath, finds the file, and reads its contents into a string.Template Rendering: Airflow looks for
{{ ds }}and swaps it for the actual date from the Context Dictionary.Hook Execution: The Operator hands that final "rendered" string to the Hook to run on the database.
Operators vs. Hooks: The "What" vs. the "How"
This is a fundamental architectural concept in Airflow.
The Operator (The "What"): This is what you see in your DAG. It defines a high-level goal, like "Execute this SQL query." It acts as the user interface.
The Hook (The "How"): This is a hidden worker inside the operator. You usually don't write code for the Hook yourself. The Hook handles the "dirty work": opening the connection to Postgres, managing the network handshake, sending the data, and closing the connection.
Key Takeaway: You use the Operator to define your pipeline logic; the Operator calls the Hook to talk to the actual database.
External Dependencies (Providers)
Airflow doesn't come "pre-installed" with the ability to talk to every database. It uses a modular system. To talk to PostgreSQL, you need to install a specific "Provider" package. This keeps the core of Airflow lightweight.
Package needed:
apache-airflow-providers-common-sql(and often the specific Postgres providerapache-airflow-providers-postgres).Analogy: Think of Airflow as a smartphone and the Providers as apps. To take a "Postgres photo," you need to download the "Postgres app."
Putting it Together: The Data Flow
In your specific Wikipedia example, the process looks like this:
Schedule: Airflow triggers every hour.
Extraction: A task downloads and unzips the Wikipedia pageview data.
Transformation: Python logic extracts the specific counts you need.
Loading: The
SQLExecuteQueryOperatorreads a.sqlfile, fills in the{{ ds }}templates, and uses a Hook to insert those counts into your Postgres database.
Summary Checklist for your Code
Component
What you need to do
DAG Setting
Set template_searchpath if your SQL files are in a separate folder.
Operator
Use SQLExecuteQueryOperator(sql="my_query.sql", ...)
Dependencies
Ensure pip install apache-airflow-providers-postgres is in your environment.
Connection
Define a postgres_conn_id in the Airflow UI so the Hook knows the password/host.
Last updated