Managing extensions

DuckDB’s extension system is the "secret sauce" that allows it to remain a tiny, single-file binary while still being able to talk to S3, query Postgres, or perform geospatial analysis.

The Three Tiers of Extensions

ExtensionsDuckDB

Core Extensions

Extensions are categorized based on who maintains them and how much you should trust them.

Tier

Source

Security

Purpose

Built-in

Statically linked in the binary

✅ Maximum

Critical features like parquet and json.

Core

Official DuckDB Repository

✅ Signed by DuckDB

High-performance connectors like httpfs (S3), postgres_scanner, and iceberg.

Community

community-extensions.duckdb.org

⚠️ Use with care

Specialized tools like gsheets (Google Sheets) or prql.

Autoloading: "Magic" for Engineers

One of the best features for your ETL scripts is Autoloading. You don't always have to run INSTALL or LOAD.

If you run SELECT * FROM 's3://my-bucket/data.parquet', DuckDB detects the s3:// prefix and automatically installs and loads the httpfs extension for you.

Note: You should note that autoloading is enabled by default in the CLI but might need to be explicitly configured when using DuckDB in a Golang or Python application for security reasons.

Install and load an extension manually:

INSTALL httpfs;
LOAD httpfs;

Essential Extensions for Your Kit

As you build your skills, these are the ones you will use 90% of the time:

httpfs (S3/HTTP): Allows you to query files directly from cloud storage.
postgres_scanner / mysql_scanner: Allows you to "mount" an external database and join its tables directly with local Parquet files.
spatial: Adds PostGIS-like capability for geographic data.
iceberg / delta: Crucial for "Lakehouse" architectures. It lets DuckDB read the metadata of these modern table formats.

Managing Extensions via CLI

Since you’re mastering the CLI commands, add these to your reference list:

-- List all available extensions and their status
SELECT * FROM duckdb_extensions();

-- Explicitly install and load a community extension
INSTALL gsheets FROM community;
LOAD gsheets;

-- Security: Lock down the environment (Common in production)
SET allow_community_extensions = false;
SET autoinstall_known_extensions = false;

`httpfs` extension

While S3 is the most common use case for the httpfs extension, its name actually gives away its broader purpose: HTTP File System. It essentially treats the entire internet (anything reachable via a URL) as a local drive for DuckDB.

Here is how httpfs works across different protocols and providers.

Direct HTTP(S) Querying

You don't need a cloud provider account to use httpfs. You can query any publicly hosted file (CSV, Parquet, or JSON) directly from a URL.

-- Querying a dataset from a GitHub repository or a public API
SELECT * FROM read_parquet('https://github.com/datasets/sample.parquet');

The "Range Request" Magic: This is the most impressive part. If you query a 10GB Parquet file but only need 1 column, httpfs uses HTTP Range Requests to download only the specific bytes for that column. It doesn't download the whole file.

Advanced httpfs Configuration

To make these remote queries work reliably, you often need to tune the "connection" settings. These are great to include in your "Best Practices":

s3_uploader_max_filesize: Useful if you are using COPY to write to S3. It controls the multipart upload chunk size.
http_timeout: If you're on a slow connection (like at a cafe) and querying a large remote file, increase this to prevent the query from timing out.
http_retries: Crucial for production ETL. If a network hiccup occurs, DuckDB will retry the range request instead of failing the whole pipeline.

SET http_timeout = 30000; -- 30 seconds
SET http_retries = 5;

Global Cloud Support (GCS, AWS, and Azure)

httpfs handles Amazon S3, Google Cloud Storage (GCS) and Azure Blob Storage using the same logic but different secret prefixes.

Amazon S3

S3 is the gold standard for httpfs. DuckDB doesn't just read from S3; it can also write to it, making it a powerful tool for uploading your "Silver" or "Gold" Parquet files directly to the cloud.

1. Connection Methods

There are two ways to connect DuckDB to your S3 bucket:

A. The "Legacy" Way (Environment Variables/Settings) This is what you'll see in most older tutorials.

SET s3_region = 'us-east-1';
SET s3_access_key_id = 'AKIA...';
SET s3_secret_access_key = 'secret...';

SELECT * FROM read_parquet('s3://my-bucket/trips.parquet');

B. The "Modern" Way (Secrets Manager) As of 2024/2025, this is the best practice. It’s cleaner and allows you to manage multiple sets of credentials (e.g., one for prod, one for dev).

CREATE SECRET s3_production (
    TYPE S3,
    KEY_ID 'AKIA...',
    SECRET 'secret...',
    REGION 'us-east-1'
);

2. Writing to S3 (The "Uploader")

As a Data Engineer, you won't just be querying; you'll be publishing. httpfs allows you to stream a query result directly into an S3 object:

COPY (SELECT * FROM local_table) 
TO 's3://my-bucket/transformed_data.parquet' 
(FORMAT PARQUET);

3. S3-Compatible Storage (MinIO / Cloudflare R2)

One of the best "hacks" for your portfolio is using httpfs with MinIO (for local S3 testing) or Cloudflare R2 (for zero-egress-fee storage). You simply change the endpoint:

SET s3_endpoint = 'localhost:9000'; -- For MinIO
SET s3_use_ssl = false;              -- MinIO local usually doesn't use SSL
SET s3_url_style = 'path';           -- Required for MinIO

Google Cloud Storage (GCS)

DuckDB uses the s3:// syntax for GCS as well, but you have to point it to the Google endpoint:

SET s3_endpoint = 'storage.googleapis.com';
SET s3_access_key_id = 'your_gcs_key';
SET s3_secret_access_key = 'your_gcs_secret';

SELECT count(*) FROM read_parquet('s3://my-gcs-bucket/data.parquet');

Azure Blob Storage

Azure uses its own dedicated prefix (az://) and credential settings:

SET azure_storage_connection_string = 'DefaultEndpointsProtocol=https;AccountName=...';

SELECT * FROM read_parquet('az://container/folder/file.parquet');

Cloud Protocol Summary Table

This is how you should categorize the protocols:

Protocol

Service

Extension Required

Key Settings

s3://

AWS S3

httpfs

s3_region, s3_access_key_id

s3://

Google GCS

httpfs

s3_endpoint = 'storage.googleapis.com'

s3://

MinIO / R2

httpfs

s3_endpoint, s3_url_style

az://

Azure Blob

azure (built on httpfs)

azure_storage_connection_string

https://

Public Web

httpfs

None (Zero-Config)

Security: Secrets Manager (DuckDB 0.10+)

https://duckdb.org/docs/stable/configuration/secrets_manager, https://duckdb.org/docs/stable/sql/statements/create_secret

Instead of manually SET-ing keys (which can leak into your logs), DuckDB now has a dedicated Secrets Manager within the extension system:

-- Creating a reusable secret
CREATE SECRET my_s3_config (
    TYPE S3,
    KEY_ID 'AKIA...',
    SECRET 'abc...',
    REGION 'us-east-1'
);

Once created, DuckDB automatically applies these credentials whenever you query an s3:// path.

The "Scanner" Pattern (External Databases)

While httpfs allows you to query files, the Scanner extensions (suffix _scanner) allow you to treat entire external databases as if they were local DuckDB schemas. This is a level above just reading a file.

Example: The SQLite Scanner If you have an old SQLite database from a previous project, you don't need to export it to CSV first.

-- Install and load the scanner
INSTALL sqlite_scanner;
LOAD sqlite_scanner;

-- Attach the SQLite file as a database schema
ATTACH 'my_old_data.db' AS legacy_sqlite (TYPE SQLITE);

-- Now query across both systems
SELECT * FROM legacy_sqlite.users LIMIT 5;

The Community Repository

Since version 0.10.0, DuckDB introduced a formal Community Extension Repository. These are not hosted by DuckDB officially but are accessible through the CLI. To use these, you must specify the source.

Example: The Excel Extension While your list showed excel as true, there is a community-enhanced version for advanced formatting.

-- Installing from the community repository
INSTALL spatial FROM community; 
LOAD spatial;

Safety Note: When using community extensions, DuckDB requires you to be explicit. You might need to run SET custom_extension_repository = 'community-extensions.duckdb.org'; depending on your environment settings.

Persistent vs. In-Memory Installations

A common point of confusion for new engineers is where these extensions actually "live" once you run INSTALL.

Location: Extensions are usually installed in a hidden folder in your home directory (e.g., ~/.duckdb/extensions/...).
Persistence: You only need to INSTALL an extension once per machine/version. However, you (or the autoloader) must LOAD it every time you start a new session.

You can verify the version of the extension to ensure it matches your DuckDB binary:

SELECT extension_name, installed, install_path 
FROM duckdb_extensions() 
WHERE installed = true;

Full-Text Search (FTS)

Your list showed fts as installed. This is a "Built-in" extension that turns DuckDB into a mini-search engine (similar to Elasticsearch but much simpler).

Example: Indexing a column for search

-- Load the extension
LOAD fts;

-- Create a search index on a table you created in your ETL
PRAGMA create_fts_index('silver_taxi', 'vendor_id', 'store_and_fwd_flag');

-- Use the 'match' function to find records
SELECT * FROM (SELECT *, fts_main_silver_taxi.match_bm25(vendor_id, '1') AS score FROM silver_taxi) 
WHERE score IS NOT NULL 
ORDER BY score DESC;

Pro-Tip: The "Force" Load If you are working in a restricted environment (like a CI/CD runner or a locked-down server) where DuckDB cannot reach the internet to download extensions, you must pre-install them and bundle the .duckdb_extension files located in the install_path shown in the query above.

PreviousImporting/Exporting data NextDuckDB UI

Last updated 16 days ago

hashtagThe Three Tiers of Extensions

hashtagAutoloading: "Magic" for Engineers

hashtagEssential Extensions for Your Kit

hashtagManaging Extensions via CLI

hashtaghttpfs extension

hashtagGlobal Cloud Support (GCS, AWS, and Azure)

hashtagAmazon S3

hashtagGoogle Cloud Storage (GCS)

hashtagAzure Blob Storage

hashtagCloud Protocol Summary Table

hashtagSecurity: Secrets Manager (DuckDB 0.10+)

hashtagThe "Scanner" Pattern (External Databases)

hashtagThe Community Repository

hashtagPersistent vs. In-Memory Installations

hashtagFull-Text Search (FTS)

The Three Tiers of Extensions

Autoloading: "Magic" for Engineers

Essential Extensions for Your Kit

Managing Extensions via CLI

`httpfs` extension

Global Cloud Support (GCS, AWS, and Azure)

Amazon S3

Google Cloud Storage (GCS)

Azure Blob Storage

Cloud Protocol Summary Table

Security: Secrets Manager (DuckDB 0.10+)

The "Scanner" Pattern (External Databases)

The Community Repository

Persistent vs. In-Memory Installations

Full-Text Search (FTS)