FDW (read replicas)


Data Pipeline Pattern: The Federated Read Replica

This pattern solves a specific, common problem in the "Stage 1" data stack: How do we run dbt and analytics on production data without taking down the application?

It combines two distinct concepts:

  1. Infrastructure: The Read Replica (The safe place where data lives).

  2. Interface: The Foreign Data Wrapper (The bridge how you access it).


The Core Problem: The "Read-Only Trap"

When setting up analytics on Postgres, you face a dilemma.

  • Option A: Connect to Primary.

    • Risk: Your heavy analytical queries (e.g., SELECT * FROM orders) compete for CPU with your live application. If you spike the CPU, the app crashes.

  • Option B: Connect to a Read Replica.

    • Risk: Standard Physical Replicas are strictly Read-Only. Tools like dbt require a writable environment to create tables, views, and schemas (dbt run fails because it cannot CREATE TABLE).

The Solution: The Federated Architecture

The solution is to decouple the Compute (where dbt runs) from the Storage (where the data lives).

  • The "Analytics DB" (Host): A small, separate, writable Postgres instance. This is where dbt connects. It stores your models, views, and marts.

  • The "Read Replica" (Remote): The read-only copy of production.

  • The Bridge (FDW): You use a Foreign Data Wrapper to mount the Replica's tables inside the Analytics DB.

Why this works

You get the best of both worlds:

  1. Safety: The heavy data lifting happens on the Replica, protecting the Primary.

  2. Flexibility: The Analytics DB is writable, so dbt can build whatever it needs.


Component Deep Dive: The Read Replica

Role: The Infrastructure (Source)

A Read Replica is a physical copy of your primary database. It receives a stream of changes (WAL files) from the primary to stay synchronized.

  • Key Characteristic: It is Read-Only. You cannot execute INSERT, UPDATE, or DDL (Create/Alter) commands.

  • The "Lag": Replicas are "Eventually Consistent." Data might be a few milliseconds or seconds behind the primary.

  • Purpose in this Stack: It acts as the "Heavy Lifter." When you query data, the Replica's CPU and Memory are used to scan the tables, keeping your Primary safe.


Component Deep Dive: Foreign Data Wrappers (Data Federation)

Role: The Interface (Bridge)

The Foreign Data Wrapper (FDW) is the implementation of the ISO SQL/MED standard ("Management of External Data"), released in 2003. It allows one database to read/write data in another system as if it were local tables.

While Postgres is the most famous modern user of this name, this is a universal architectural pattern known in the industry as Data Federation, Data Virtualization, or Database Linking.

The "Puppet Master" Pattern

The core idea is Abstraction. You have a "Host" database (The Analytics DB) and a "Remote" source (The Read Replica).

  1. The Illusion: The Host creates a "Virtual Table" (Foreign Table). It looks like a table, has columns like a table, and accepts SQL.

  2. The Translation: When you query it, the Host doesn't scan its own disk. It translates your SQL into the Remote system's language (API call, proprietary protocol, or SQL dialect).

  3. The Network Hop: The Remote system executes the query and sends the raw results back to the Host.

Universal Landscape

Every major database implements this pattern under different names:

Technology

Feature Name

How it works

Oracle

Database Link

The grandfather of this pattern (SELECT * FROM table@remote). Powerful but tightly coupled.

SQL Server

Linked Server

Connects MSSQL to Oracle, Excel, etc. via OLEDB.

Snowflake

External Tables

Allows querying files in S3/Azure Blob without loading them ("FDW for Files").

BigQuery

Omni

Executes SQL against data in AWS S3 or Azure.

Trino / Presto

Query Engine

The "Pure" version. A database with no storage that connects to everything (Kafka, S3, Postgres) simultaneously.

Why Postgres FDW is Unique

Postgres made the interface Modular. Unlike Oracle or SQL Server, which mostly connect to other databases, the Postgres FDW community has written wrappers for everything:

  • Twitter FDW: Search tweets with SQL.

  • Stripe FDW: Query your payments API as a table.

  • Salesforce FDW: Join CRM data directly with your app data.


Performance Critical: Predicate Pushdown

The success of this pattern relies entirely on Predicate Pushdown. This determines whether your query is fast or a disaster.

  • Bad FDW (Naïve):

    • Query: SELECT * FROM remote_table WHERE id = 5

    • Action: The Host pulls ALL 1 million rows over the network, then filters for id=5 locally.

    • Result: Network Bottleneck & Timeout.

  • Good FDW (Smart):

    • Query: SELECT * FROM remote_table WHERE id = 5

    • Action: The Host sends the specific SQL WHERE id = 5 to the Remote system. The Remote system filters it and sends back 1 row.

    • Result: Fast.

Note: Postgres postgres_fdw is very smart about pushdown. It will push WHERE clauses, JOINs (if both tables are on the same remote server), and even some aggregates to the remote replica.


Implementation Checklist

To set this up in Postgres, you need 4 specific SQL commands on your Analytics DB (Host):

Summary Comparison

Read Replica

Foreign Data Wrapper

Category

Infrastructure (Server)

Interface (Software)

Primary Goal

Reliability (Handle Load)

Connectivity (Access Data)

Storage

Heavy (Physical Copy)

Zero (Metadata Only)

Best Use

Offloading traffic from Primary.

Bridging a writable DB to a read-only source.

When to use this pattern?

  • Use it when: You need real-time access to fresh production data for dbt models without building a complex ETL pipeline (Airbyte/Fivetran).

  • Avoid it when: You have massive datasets (TB scale). The network hop between the Replica and the FDW Host will eventually become too slow for complex joins.


Last updated