DuckDB
Lecture from Andy Pavlo about DuckDB internals
Connecting to DuckDB: https://duckdb.org/docs/stable/
DuckDB is essentially the "SQLite of Analytics." While SQLite is designed for transactional (OLTP) workloads (like a mobile app's local storage), DuckDB is built for analytical (OLAP) workloads—think massive aggregations, joins, and filtering on millions or billions of rows.
Core Capabilities
Columnar Storage & Vectorized Execution: Instead of reading row-by-row, it processes data in batches (vectors). This makes aggregations like
SUM()orAVG()incredibly fast.Out-of-Process Analysis: You can query Parquet, CSV, and JSON files directly without importing them into the database.
Zero-Dependency: It’s a single binary. No server to install, no
systemdto manage.Larger-than-Memory Processing: If a dataset doesn't fit in RAM, DuckDB "spills" to disk automatically, allowing you to process 100GB of data on a 16GB laptop.
DuckLake Support: As of late 2025/2026, DuckDB has matured its DuckLake format, providing lakehouse-style capabilities (ACID, versioning) locally.
The CLI & Dot Commands
The CLI is the fastest way to explore data. Start it by typing duckdb in your terminal.
🦆 Essential DuckDB CLI Commands
These commands manage the CLI environment and are prefixed with a .
Command
Purpose
Example / Usage
.open
Closes the current session and opens/creates a specific database file.
.open production_metrics.db
.read
Executes SQL statements from an external file. Perfect for running saved ETL scripts.
.read transform_logs.sql
.tables
Lists all tables, views, and sequences in the current database.
.tables
.timer
Toggles the execution timer to see how long a query takes.
.timer on
.mode
Changes the output format (e.g., duckbox, markdown, csv, json).
.mode markdown
.maxrows
Limits the number of rows displayed in the CLI (prevents flooding your screen).
.maxrows 20
.excel
Opens the result of your next query in your system's spreadsheet viewer.
.excel
SELECT * FROM large_parquet;
.exit
Safely closes the database and exits the CLI.
.exit (or Ctrl+D)
.import [f] [t]
Directly imports a CSV/TSV file into a new or existing table.
.import data.csv raw_table
.schema [t]
Shows the CREATE TABLE statement for a specific table.
.schema gold_users
.once [file]
Redirects only the very next query result to a file.
.once sample.csv
Table-based formats (best for few columns): duckbox, box, csv, ascii, table, list, and column. These display data in traditional row-and-column layouts.
Line-based formats (best for many columns): json, jsonline, and line. These show each record vertically or in separate lines, making wide datasets easier to read.
You can always check your current output mode by running .mode without arguments, and switch back to the default duckbox mode with .mode duckbox.
Running from Terminal (Non-interactive)
This non-interactive capability is exactly what makes DuckDB a "Data Engineering Swiss Army Knife." Because it can pipe data in and out via stdin and stdout, you can chain it together with other CLI tools like curl, grep, jq .
You can run SQL directly without entering the shell:
The Basics: -c and Piping
The -c flag allows you to pass a SQL string directly. By default, DuckDB will print the results to your terminal, but you can redirect that output.
Piping Data INTO DuckDB
You can pipe the output of another command (like cat or curl) directly into DuckDB using the /dev/stdin path:
"Friendly SQL" Extensions
DuckDB has modified standard SQL to make it less verbose for engineers:
SELECT * EXCLUDE (col_name): Select everything except specific sensitive or redundant columns.FROM table_name: You don't even needSELECT *. Just starting withFROMworks.GROUP BY ALL: Automatically groups by every column that isn't an aggregate. No more listing 15 columns in yourGROUP BYclause.COLUMNS(...): Allows regex matching on column names, e.g.,SELECT COLUMNS('.*_id') FROM table.
Best Practices for Data Engineers
Avoid the "Small Files Trap"
Many tiny Parquet files kill performance.
Practice: Consolidate data into files of 100MB–500MB.
Command: Use
COPY ... TO ... (FORMAT PARQUET, OVERWRITE_OR_IGNORE)to compact them.
Use read_csv_auto
read_csv_autoDon't define schemas manually unless you have to. DuckDB’s sniffing logic is top-tier.
Memory Management
On shared systems or large ETL jobs, explicitly set your memory limit to prevent the OS from killing the process:
Direct Parquet Querying
When building pipelines, don't "load" data into a DuckDB table unless you need to run multiple queries against it. Querying the Parquet file directly is often just as fast and saves disk space.
In-memory mode: :memory:
:memory:While not a flag starting with a dash, the first argument to duckdb is the database file. Using :memory: explicitly tells DuckDB not to touch the disk at all.
Demo
To showcase the "Zero-Persistence" and "Transformation" concepts, let's build a pipeline that takes a nested JSON file, "shreds" it into a flat structure, and saves it as an optimized Parquet file—all without ever creating a permanent table in a database.
The "Shred & Convert" Pipeline
Imagine you have a JSON file (raw_data.json) where each row contains a nested object:
You can run this single command to transform and export it:
Concepts covered:
Zero Persistence: We used
read_json_autodirectly in theFROMclause. DuckDB didn't "import" the data; it streamed it from the file, processed it in memory, and wrote it back out.Shredding (JSON Flattening):
The
user.namesyntax "shreds" theuserstruct into individual columns.The
unnest(tags)function takes a list and creates a new row for every item (turning one user row into multiple rows if they have multiple tags).
Translation (JSON to Parquet): In one step, you've converted a row-based, semi-structured format (JSON) into a high-performance, columnar format (Parquet).
Level Up: The Cloud Pipeline
If you're interested in S3, you can even do this across the internet. This command reads a remote JSON and uploads the shredded result to your S3 bucket:
Key Takeaway
This approach is often called "One-Shot ETL." It’s incredibly useful for Data Engineers because it’s stateless—you don't have to manage a database server or a cleanup process. You just run the command and get your "Silver" data.
Last updated