Managing extensions


DuckDB’s extension system is the "secret sauce" that allows it to remain a tiny, single-file binary while still being able to talk to S3, query Postgres, or perform geospatial analysis.


The Three Tiers of Extensions

Core Extensionsarrow-up-right

Extensions are categorized based on who maintains them and how much you should trust them.

Tier

Source

Security

Purpose

Built-in

Statically linked in the binary

✅ Maximum

Critical features like parquet and json.

Core

Official DuckDB Repository

✅ Signed by DuckDB

High-performance connectors like httpfs (S3), postgres_scanner, and iceberg.

Community

community-extensions.duckdb.org

⚠️ Use with care

Specialized tools like gsheets (Google Sheets) or prql.


Autoloading: "Magic" for Engineers

One of the best features for your ETL scripts is Autoloading. You don't always have to run INSTALL or LOAD.

  • If you run SELECT * FROM 's3://my-bucket/data.parquet', DuckDB detects the s3:// prefix and automatically installs and loads the httpfs extension for you.

Note: You should note that autoloading is enabled by default in the CLI but might need to be explicitly configured when using DuckDB in a Golang or Python application for security reasons.

Install and load an extension manually:


Essential Extensions for Your Kit

As you build your skills, these are the ones you will use 90% of the time:

  • httpfs (S3/HTTP): Allows you to query files directly from cloud storage.

  • postgres_scanner / mysql_scanner: Allows you to "mount" an external database and join its tables directly with local Parquet files.

  • spatial: Adds PostGIS-like capability for geographic data.

  • iceberg / delta: Crucial for "Lakehouse" architectures. It lets DuckDB read the metadata of these modern table formats.


Managing Extensions via CLI

Since you’re mastering the CLI commands, add these to your reference list:


httpfs extension

While S3 is the most common use case for the httpfs extension, its name actually gives away its broader purpose: HTTP File System. It essentially treats the entire internet (anything reachable via a URL) as a local drive for DuckDB.

Here is how httpfs works across different protocols and providers.

Direct HTTP(S) Querying

You don't need a cloud provider account to use httpfs. You can query any publicly hosted file (CSV, Parquet, or JSON) directly from a URL.

  • The "Range Request" Magic: This is the most impressive part. If you query a 10GB Parquet file but only need 1 column, httpfs uses HTTP Range Requests to download only the specific bytes for that column. It doesn't download the whole file.

Advanced httpfs Configuration

To make these remote queries work reliably, you often need to tune the "connection" settings. These are great to include in your "Best Practices":

  • s3_uploader_max_filesize: Useful if you are using COPY to write to S3. It controls the multipart upload chunk size.

  • http_timeout: If you're on a slow connection (like at a cafe) and querying a large remote file, increase this to prevent the query from timing out.

  • http_retries: Crucial for production ETL. If a network hiccup occurs, DuckDB will retry the range request instead of failing the whole pipeline.


Global Cloud Support (GCS, AWS, and Azure)

httpfs handles Amazon S3, Google Cloud Storage (GCS) and Azure Blob Storage using the same logic but different secret prefixes.

Amazon S3

S3 is the gold standard for httpfs. DuckDB doesn't just read from S3; it can also write to it, making it a powerful tool for uploading your "Silver" or "Gold" Parquet files directly to the cloud.

1. Connection Methods

There are two ways to connect DuckDB to your S3 bucket:

A. The "Legacy" Way (Environment Variables/Settings) This is what you'll see in most older tutorials.

B. The "Modern" Way (Secrets Manager) As of 2024/2025, this is the best practice. It’s cleaner and allows you to manage multiple sets of credentials (e.g., one for prod, one for dev).

2. Writing to S3 (The "Uploader")

As a Data Engineer, you won't just be querying; you'll be publishing. httpfs allows you to stream a query result directly into an S3 object:

3. S3-Compatible Storage (MinIO / Cloudflare R2)

One of the best "hacks" for your portfolio is using httpfs with MinIO (for local S3 testing) or Cloudflare R2 (for zero-egress-fee storage). You simply change the endpoint:

Google Cloud Storage (GCS)

DuckDB uses the s3:// syntax for GCS as well, but you have to point it to the Google endpoint:

Azure Blob Storage

Azure uses its own dedicated prefix (az://) and credential settings:

Cloud Protocol Summary Table

This is how you should categorize the protocols:

Protocol

Service

Extension Required

Key Settings

s3://

AWS S3

httpfs

s3_region, s3_access_key_id

s3://

Google GCS

httpfs

s3_endpoint = 'storage.googleapis.com'

s3://

MinIO / R2

httpfs

s3_endpoint, s3_url_style

az://

Azure Blob

azure (built on httpfs)

azure_storage_connection_string

https://

Public Web

httpfs

None (Zero-Config)


Security: Secrets Manager (DuckDB 0.10+)

https://duckdb.org/docs/stable/configuration/secrets_managerarrow-up-right, https://duckdb.org/docs/stable/sql/statements/create_secretarrow-up-right

Instead of manually SET-ing keys (which can leak into your logs), DuckDB now has a dedicated Secrets Manager within the extension system:

Once created, DuckDB automatically applies these credentials whenever you query an s3:// path.


The "Scanner" Pattern (External Databases)

While httpfs allows you to query files, the Scanner extensions (suffix _scanner) allow you to treat entire external databases as if they were local DuckDB schemas. This is a level above just reading a file.

Example: The SQLite Scanner If you have an old SQLite database from a previous project, you don't need to export it to CSV first.


The Community Repository

Since version 0.10.0, DuckDB introduced a formal Community Extension Repository. These are not hosted by DuckDB officially but are accessible through the CLI. To use these, you must specify the source.

Example: The Excel Extension While your list showed excel as true, there is a community-enhanced version for advanced formatting.

  • Safety Note: When using community extensions, DuckDB requires you to be explicit. You might need to run SET custom_extension_repository = 'community-extensions.duckdb.org'; depending on your environment settings.


Persistent vs. In-Memory Installations

A common point of confusion for new engineers is where these extensions actually "live" once you run INSTALL.

  • Location: Extensions are usually installed in a hidden folder in your home directory (e.g., ~/.duckdb/extensions/...).

  • Persistence: You only need to INSTALL an extension once per machine/version. However, you (or the autoloader) must LOAD it every time you start a new session.

You can verify the version of the extension to ensure it matches your DuckDB binary:


Full-Text Search (FTS)

Your list showed fts as installed. This is a "Built-in" extension that turns DuckDB into a mini-search engine (similar to Elasticsearch but much simpler).

Example: Indexing a column for search

Pro-Tip: The "Force" Load If you are working in a restricted environment (like a CI/CD runner or a locked-down server) where DuckDB cannot reach the internet to download extensions, you must pre-install them and bundle the .duckdb_extension files located in the install_path shown in the query above.


Last updated