DLT - Data Load Tool

Data ingestion tool

Code examples with various scenarios

How dlt manages schema

Creating a pipeline

About pipeline state

Use workspace dashboard (web app) to see the state of a pipeline you've run locally

Adjusting the schema after running the pipeline

Accessing loaded data

Transforming data after loading it

Optimizing dlt performance

From local to production, Deploy a pipeline with Airflow

dlt workshops

How dlt uses Apache Arrow in its pipelines

What is dlt?

It is an open-source Python library that you install via a package manager like pip, poetry, uv, etc. Unlike platforms (like Airbyte or Fivetran) that run as separate services, dlt runs inside your Python code. This means you can run it in a Jupyter notebook, a Lambda function, or as a task within an Airflow DAG.

The "Killer Features" for Data Engineers

If you are building pipelines, dlt solves three specific headaches that usually require writing a lot of custom code:

1. Automated Schema Inference & Evolution This is arguably its strongest feature. If you pull a JSON object from an API that has a new field user_rank that wasn't there yesterday:

Traditional way: The pipeline fails because the target table doesn't have that column.
With dlt: It detects the new field, alters the table in your Data Warehouse (Snowflake, BigQuery, DuckDB, etc.) to add the column, and then loads the data.

2. Automatic Normalization (Unnesting) It automatically handles nested JSON. If an API returns a list of dictionaries nested inside a key, dlt will automatically break that out into a child table and generate the foreign keys to link it back to the parent record.

3. Declarative Incremental Loading Instead of writing complex SQL logic to check MAX(updated_at), you can flag a field in your Python resource as the cursor, and dlt manages the state for you.

Where it fits in a Stack

Because it is just a library, it fits into the "Extract" and "Load" steps of an ELT pipeline.

You write: A Python script that yields data (dictionaries or lists).
dlt handles: Buffering, normalizing, creating tables, and inserting data into the destination.
Orchestration: You still use Airflow, Dagster, or Prefect to schedule the script.

Quick Example Mental Model

You define a "resource" (source of data) using a decorator, and then "run" a pipeline.

import dlt

# Define where data comes from
@dlt.resource(write_disposition="append")
def my_api_data():
    yield [{"id": 1, "name": "Alice"}, {"id": 2, "name": "Bob"}]

# Define the pipeline
pipeline = dlt.pipeline(
    pipeline_name="users_pipe",
    destination="duckdb", # Could be snowflake, bigquery, postgres
    dataset_name="user_data"
)

# Run it
info = pipeline.run(my_api_data())
print(info)

Why engineers are liking it

It bridges the gap between "I'll just write a quick Python script" (which is fast but fragile) and "I need to set up a heavy Enterprise ETL tool" (which is robust but expensive/complex). It allows you to write custom Python logic for extraction but gives you the robustness of an enterprise tool for the loading part.

PreviousApache NIFI NextSlingdata

Last updated 16 days ago

hashtagWhat is dlt?

hashtagThe "Killer Features" for Data Engineers

hashtagWhere it fits in a Stack

hashtagQuick Example Mental Model

hashtagWhy engineers are liking it

What is dlt?

The "Killer Features" for Data Engineers

Where it fits in a Stack

Quick Example Mental Model

Why engineers are liking it