Data Virtualization


Data virtualization takes the concept of federated queries further by creating a dedicated abstraction layer that provides unified access to multiple data sources without storing the data itself. Think of it as a "virtual database" that acts as an intermediary between users and their distributed data sources.

How It Differs from Traditional Federated Queries

While federated queries are a feature within databases that also store data, data virtualization systems are purpose-built engines that:

  • Don't maintain their own persistent data storage

  • Specialize in connecting to and querying across heterogeneous sources

  • Act as a unified query interface across your entire data ecosystem

Leading examples: Trino (commercially supported as Starburst), Presto, Dremio, and Denodo are prominent data virtualization platforms.

Query Pushdown: The Performance Secret

One of the most critical techniques in data virtualization is query pushdown. Here's how it works:

Imagine you want to query Snowflake data, join it with a MySQL table, and filter results where region = 'US' and date > '2024-01-01'. Instead of pulling all data into the virtualization layer and then filtering, query pushdown:

  1. Pushes filtering predicates down to the source systems (Snowflake and MySQL execute the filters locally)

  2. Reduces network transfer - only relevant rows cross the network

  3. Leverages source optimization - each database uses its own query optimizer and indexes

  4. Offloads computation - the virtualization layer does less work

This is crucial because network bandwidth is often the primary bottleneck in virtualized architectures.

Key Considerations

Supported Data Sources: The value of a virtualization platform depends heavily on its connectors—can it access your specific databases, cloud storage, APIs, and SaaS applications?

Performance Trade-offs: Because data virtualization queries sources in real-time, performance depends on:

  • Source system speed and load

  • Network latency and bandwidth

  • Effectiveness of query pushdown

  • Join strategies across disparate systems

Common Use Cases (and Anti-patterns)

✅ Good Uses:

Ad-hoc cross-system analytics: When analysts occasionally need to join data across systems without building ETL pipelines

Data ingestion component: Use Trino to query a production MySQL database once daily during low-traffic hours, writing results to S3 for downstream use—protecting the production system from continuous analytics load

Data mesh access layer: Enable teams to query data owned by other teams without data duplication, supporting decentralized data architectures

Expanding the data lake: Virtualize access to data that remains in operational systems while frequently-used transformed data lives in S3

❌ Anti-patterns:

Virtualizing production databases for continuous analytics: If you virtualize a production MySQL database and run heavy analytics queries all day, you're still hammering that production system. Virtualization doesn't magically solve the problem—it just moves where the query originates. The load still hits MySQL.

Performance-critical dashboards: Real-time queries across multiple slow sources will frustrate users

Complex joins across incompatible systems: Joining large datasets from systems with very different performance characteristics can be painfully slow

Data Virtualization in Modern Architectures

Data virtualization aligns particularly well with data mesh principles, where:

  • Domain teams own and maintain their data products

  • Data remains distributed across organizational boundaries

  • A virtualization layer provides discovery and access

  • Teams can query other teams' data without copying it into their own systems

This breaks down data silos by creating a logical data fabric over physical data stores, enabling self-service analytics while respecting organizational boundaries and data ownership.

The Bottom Line

Data virtualization is powerful for organizations with data scattered across many systems, but it's not a silver bullet. It works best when:

  • Query frequency is moderate (not continuous high-volume)

  • Source systems can handle the query load

  • Network latency is acceptable

  • You need flexibility more than raw performance

For frequently-accessed data requiring fast queries, traditional data warehousing with materialized data often remains the better choice. The sweet spot is using virtualization strategically—as part of a hybrid architecture that combines materialized storage for hot data with virtualized access to cold or infrequently-queried sources.


Last updated