Data Vault 2.0


Books:

  • Building a Scalable Data Warehouse with Data Vault 2.0

  • The Elephant in the Fridge: Guided Steps to Data Vault Success through Building Business-Centered Models, 2019

Short video about Data Vaultarrow-up-right

Data Vault with dbt Cloudarrow-up-right


Data Vault 2.0 — What It Really Is

Data Vault 2.0 is a methodology and data modeling pattern designed to handle large, rapidly changing, and unpredictable data ecosystems.

It is built for:

  • frequent schema changes

  • messy or inconsistent source systems

  • incremental data arrival (batch or streaming)

  • multi-source integration

  • auditability

  • parallel processing

The guiding philosophy:

Capture everything, change nothing, load fast. Business logic belongs at query time, not ingestion time.

This is the opposite of Kimball (logic in marts) and different from Inmon (logic in normalized EDW).


Why Data Vault?

Traditional warehouse designs break when:

  • sources change keys

  • attributes shift between tables

  • new fields appear

  • relationships change

  • entities merge or split

  • sources contradict each other

  • loads need to be parallelized

Data Vault is designed to survive all of this by splitting structure from attributes, which is the heart of the model.

Architecture of Data Vault 2.0

spinner

The Three Core Table Types

Hubs — the business keys

A hub anchors the model around a meaningful business identifier:

  • CustomerID

  • ProductID

  • OrderID

  • EmployeeID

Hubs contain:

  • Hash key (surrogate PK, calculated via MD5/SHA)

  • Business key (natural key)

  • Load date

  • Record source

Insert-only. Never updated.

They answer:

“What are the core business entities we track over time?”


Links capture associations between hubs:

  • Order ↔ Product

  • Customer ↔ Address

  • Employee ↔ Department

They also use hash keys and load metadata.

Insert-only. Never updated.

They answer:

“How do business keys relate, without embedding logic?”

This makes relationships extremely flexible when the business changes.

spinner

Satellites — the context / attributes

Satellites store the descriptive data about a hub or link:

  • Product name, category, price

  • Order status, channel

  • Customer demographic attributes

Contain:

  • Parent hash key

  • Load date

  • Record source

  • Attribute columns

Insert-only, historized. Old records remain. New attributes appear → simply create a new satellite.

They answer:

“What attributes did this business key have at this time, from this source?”


How You Query a Data Vault

Raw Data Vault is not convenient for analytics.

To get a usable dataset, you build a Business Vault or information mart on top:

  1. Join hub

  2. Join links

  3. Join satellites

  4. Apply business rules

  5. Produce a dimensional model or view

Data Vault stores the raw truth; marts represent business truth.


Comparing Data Vault, Kimball, and Inmon

Inmon

  • Enterprise-wide integrated 3NF warehouse

  • Strong integration rules

  • Metadata-heavy

  • High initial cost

  • Hard to change

Kimball

  • Dimensional model (facts & dimensions)

  • Optimized for BI/analytics

  • Business logic defined early

  • ETL must conform data on load

Data Vault

  • Flexible

  • Parallel-load friendly

  • Schema-change friendly

  • Minimal business logic up front

  • Requires post-processing for usable analytics

Ideal modern pattern: Raw ingestion → Data Vault → Star schemas (marts)


Modern Architecture Fit (Warehouse / Lakehouse)

Data Vault 2.0 works extremely well with:

  • BigQuery

  • Snowflake

  • Databricks

  • Redshift

  • Synapse

DNS (“Do Nothing Stupid”) rule of DV2.0 = Use the platform strengths: micro-batches, streams, ELT.


Worked Example — Orders & Products

We’ll take your scenario:

  • Products

  • Orders

  • Order ↔ Product relationship

Step 1 — Build Hubs

HubProduct HubOrder

Tables contain hash keys, business keys, load metadata.

Example:

HubProduct:

HubOrder:


LinkOrderProduct connects orders to products:


Step 3 — Add Satellites

Attributes about products go into SatelliteProduct:

New attributes (e.g., category, weight) would simply go into another satellite.


How You Turn This Into a Star Schema

To build analytics tables:

Fact table

join LinkOrderProduct → join HubOrder → join HubProduct

Dimensions

ProductDim: join HubProduct with SatelliteProduct OrderDim: join HubOrder with its satellites (if exist)

Final BI schema looks like Kimball, but sourced from Data Vault.


When to Use Data Vault 2.0

Ideal when:

  • Many source systems

  • Constant, unpredictable change

  • Need traceability / auditability

  • Want ingestion speed above all

  • ELT tools and cloud warehouse available

  • Governance and lineage matter

Not ideal when:

  • You need a simple, fast BI warehouse

  • The business is small and stable

  • Modeling overhead needs to be minimal


Last updated