DuckDB


Lecturearrow-up-right from Andy Pavlo about DuckDB internals

Connecting to DuckDB: https://duckdb.org/docs/stable/arrow-up-right


DuckDB is essentially the "SQLite of Analytics." While SQLite is designed for transactional (OLTP) workloads (like a mobile app's local storage), DuckDB is built for analytical (OLAP) workloads—think massive aggregations, joins, and filtering on millions or billions of rows.


Core Capabilities

  • Columnar Storage & Vectorized Execution: Instead of reading row-by-row, it processes data in batches (vectors). This makes aggregations like SUM() or AVG() incredibly fast.

  • Out-of-Process Analysis: You can query Parquet, CSV, and JSON files directly without importing them into the database.

  • Zero-Dependency: It’s a single binary. No server to install, no systemd to manage.

  • Larger-than-Memory Processing: If a dataset doesn't fit in RAM, DuckDB "spills" to disk automatically, allowing you to process 100GB of data on a 16GB laptop.

  • DuckLake Support: As of late 2025/2026, DuckDB has matured its DuckLake format, providing lakehouse-style capabilities (ACID, versioning) locally.

chevron-rightAbout Memory Management in DuckDBhashtag


The CLI & Dot Commands

The CLI is the fastest way to explore data. Start it by typing duckdb in your terminal.

🦆 Essential DuckDB CLI Commands

These commands manage the CLI environment and are prefixed with a .

Command

Purpose

Example / Usage

.open

Closes the current session and opens/creates a specific database file.

.open production_metrics.db

.read

Executes SQL statements from an external file. Perfect for running saved ETL scripts.

.read transform_logs.sql

.tables

Lists all tables, views, and sequences in the current database.

.tables

.timer

Toggles the execution timer to see how long a query takes.

.timer on

.mode

Changes the output format (e.g., duckbox, markdown, csv, json).

.mode markdown

.maxrows

Limits the number of rows displayed in the CLI (prevents flooding your screen).

.maxrows 20

.excel

Opens the result of your next query in your system's spreadsheet viewer.

.excel

SELECT * FROM large_parquet;

.exit

Safely closes the database and exits the CLI.

.exit (or Ctrl+D)

.import [f] [t]

Directly imports a CSV/TSV file into a new or existing table.

.import data.csv raw_table

.schema [t]

Shows the CREATE TABLE statement for a specific table.

.schema gold_users

.once [file]

Redirects only the very next query result to a file.

.once sample.csv

Table-based formats (best for few columns): duckbox, box, csv, ascii, table, list, and column. These display data in traditional row-and-column layouts.

Line-based formats (best for many columns): json, jsonline, and line. These show each record vertically or in separate lines, making wide datasets easier to read.

You can always check your current output mode by running .mode without arguments, and switch back to the default duckbox mode with .mode duckbox.


Running from Terminal (Non-interactive)

This non-interactive capability is exactly what makes DuckDB a "Data Engineering Swiss Army Knife." Because it can pipe data in and out via stdin and stdout, you can chain it together with other CLI tools like curl, grep, jq .

You can run SQL directly without entering the shell:

The Basics: -c and Piping

The -c flag allows you to pass a SQL string directly. By default, DuckDB will print the results to your terminal, but you can redirect that output.

Piping Data INTO DuckDB

You can pipe the output of another command (like cat or curl) directly into DuckDB using the /dev/stdin path:


"Friendly SQL" Extensions

DuckDB has modified standard SQL to make it less verbose for engineers:

  • SELECT * EXCLUDE (col_name): Select everything except specific sensitive or redundant columns.

  • FROM table_name: You don't even need SELECT *. Just starting with FROM works.

  • GROUP BY ALL: Automatically groups by every column that isn't an aggregate. No more listing 15 columns in your GROUP BY clause.

  • COLUMNS(...): Allows regex matching on column names, e.g., SELECT COLUMNS('.*_id') FROM table.


Best Practices for Data Engineers

Avoid the "Small Files Trap"

Many tiny Parquet files kill performance.

  • Practice: Consolidate data into files of 100MB–500MB.

  • Command: Use COPY ... TO ... (FORMAT PARQUET, OVERWRITE_OR_IGNORE) to compact them.

Use read_csv_auto

Don't define schemas manually unless you have to. DuckDB’s sniffing logic is top-tier.

Memory Management

On shared systems or large ETL jobs, explicitly set your memory limit to prevent the OS from killing the process:

Direct Parquet Querying

When building pipelines, don't "load" data into a DuckDB table unless you need to run multiple queries against it. Querying the Parquet file directly is often just as fast and saves disk space.


In-memory mode: :memory:

While not a flag starting with a dash, the first argument to duckdb is the database file. Using :memory: explicitly tells DuckDB not to touch the disk at all.


Demo

To showcase the "Zero-Persistence" and "Transformation" concepts, let's build a pipeline that takes a nested JSON file, "shreds" it into a flat structure, and saves it as an optimized Parquet file—all without ever creating a permanent table in a database.

The "Shred & Convert" Pipeline

Imagine you have a JSON file (raw_data.json) where each row contains a nested object:

You can run this single command to transform and export it:

Concepts covered:

  • Zero Persistence: We used read_json_auto directly in the FROM clause. DuckDB didn't "import" the data; it streamed it from the file, processed it in memory, and wrote it back out.

  • Shredding (JSON Flattening):

    • The user.name syntax "shreds" the user struct into individual columns.

    • The unnest(tags) function takes a list and creates a new row for every item (turning one user row into multiple rows if they have multiple tags).

  • Translation (JSON to Parquet): In one step, you've converted a row-based, semi-structured format (JSON) into a high-performance, columnar format (Parquet).

Level Up: The Cloud Pipeline

If you're interested in S3, you can even do this across the internet. This command reads a remote JSON and uploads the shredded result to your S3 bucket:

Key Takeaway

This approach is often called "One-Shot ETL." It’s incredibly useful for Data Engineers because it’s stateless—you don't have to manage a database server or a cleanup process. You just run the command and get your "Silver" data.



Last updated