Core concepts from the book "Building the Data Warehouse"
Chapter 1: Evolution of Decision Support Systems
Chapter 1 lays the conceptual foundation for why a data warehouse is needed and what makes it fundamentally different from operational systems. The chapter moves from motivation → architectural principles → data characteristics → system components in a very structured way.
1. The Motivation: Operational Systems vs. Decision Support Systems (DSS)
Inmon starts by contrasting two worlds:
Operational Systems
Designed for fast transactions
Highly optimized for day-to-day operations
Data is application-oriented, fragmented across many systems
Not good for strategic analysis
Decision Support Systems (DSS)
Designed for analysis, not transactions
Require integrated, historical, stable data
Users need the whole picture, not just one department’s view
This mismatch creates the fundamental need for a separate environment built specifically for DSS—the data warehouse.
2. The Spider Web Problem
Before introducing architecture, Inmon describes the “spider web”:
Spider Web
When organizations try to build analytical systems directly from many operational sources, they create:
Multiple point-to-point interfaces
Inconsistent logic
Huge maintenance burden
Fragile extraction pipelines
Each new analytical request = new connection → system becomes a tangled web.
This sets the stage for the concept of a central, architected environment.
3. Architected Environment for the Data Warehouse
This is the centerpiece of Chapter 1.
Inmon says: An analytical environment must be architected — intentionally designed — rather than grown organically.
The architected environment has levels (layers), each with a defined role:
Level 1 — Operational / Source Systems
OLTP systems capturing data in real time
Data is raw, application-specific, inconsistent
Level 2 — Integration Layer (The Data Warehouse Itself)
This is where integration happens:
Standardized keys
Uniform naming conventions
Consistent encoding
Data cleaned and validated
Data becomes subject-oriented, integrated, time-variant, non-volatile
This is the foundation of the Corporate Information Factory (CIF)
Level 3 — Delivery Layer (Data Marts / EIS / Analytical Structures)
Specialized subsets of the warehouse optimized for:
Specific departments
Specific analytical use cases
Executive dashboards
OLAP operations
The flow is always one-way: Operational → Warehouse → Marts/Analytical Systems Never backwards.
4. Primitive Data vs. Derived Data
Inside the architected environment, Inmon distinguishes two types of data:
Primitive Data
The original, atomic, unprocessed facts
Comes directly from operational sources
Stored in the data warehouse as the “single source of truth”
Must be preserved for:
Re-auditing
Re-computation
Reconciliation
Derived Data
Computed from primitive data
Used in:
Data marts
Summary tables
EIS dashboards
Because requirements change, derived data can be recalculated as long as primitive data is preserved
Primitive → Derived Never the other direction.
5. Data Integration Rule: Data Cannot Enter the Data Warehouse Unintegrated
Inmon emphasizes this as a law, not a recommendation.
Why?
The warehouse’s purpose is to serve as the enterprise-wide truth. If unintegrated (non-standardized) data is loaded:
Inconsistent keys
Incompatible codes
Conflicting definitions
Analytical chaos
Department disagreements
Slowly creeping loss of trust in warehouse data
Thus:
The warehouse is the place where data becomes integrated — not a place to store raw, incompatible copies.
6. Types of Data Marts
The data marts sit downstream from the warehouse.
Inmon defines two main types:
Dependent Data Marts
Created from the data warehouse
Fully integrated
Preferred in the architected approach
Independent Data Marts
Created directly from operational systems
Not integrated
Lead to:
Inconsistent data
Departmental silos
The very spider-web problem that the warehouse is meant to solve
He strongly favors dependent marts.
7. Extract (E) in ETL
Inmon calls out Extract as the essential first step of the warehouse pipeline.
Extract’s Purpose
Pull data from many operational systems
Handle different structures, formats, and technologies
Prepare the data for cleansing and integration in the warehouse
Extract is where the spider web begins to get untangled:
Instead of many point-to-point connections
A single organized pipeline flows into the warehouse
He distinguishes “extract” from “transform + load,” which happen in the integration step.
8. Corporate Information Factory (CIF)
Inmon introduces the CIF conceptually in Chapter 1.
The CIF is essentially the macro-architecture of an enterprise analytical environment:
Operational Systems
Data Warehouse
Data Marts
Exploration/Mining Tools
Operational Data Stores (ODS)
EIS systems
Meta data
Governance processes
The warehouse is the hub; all other analytical structures are spokes.
This ties earlier concepts (integration, layers, data types, DSS needs) into a complete framework.
9. EIS — Executive Information Systems
These are part of the delivery layer.
Purpose of EIS:
Give executives:
High-level summaries
Dashboards
KPIs
Trend lines
Often fed by:
Summaries derived from warehouse data
Curated dependent data marts
EIS needs stable, integrated, historical data — which reinforces why the warehouse exists.
10. Inmon-Style Monitoring
No modern observability; monitoring in this context is:
The warehouse tracks:
When data was extracted
From where
How often updates occur
Is data flowing correctly?
What changed between loads?
Success/failure of batch jobs
Amount of data loaded
Monitoring is fundamentally about data lineage, reproducibility, and operational correctness, not about cloud metrics.
This reinforces the warehouse as an architected, controlled, auditable environment.
Chapter 2: The Data Warehouse Environment
Chapter 2 is the technical backbone of Inmon's philosophy. He defines a Data Warehouse as a collection of data with four specific characteristics.
The four characteristics are discussed throughout the chapter, therefore here's a summary of the core ideas:
Core ideas
1. Subject-Oriented
Inmon’s Definition: Data should be organized by high-level business entities (Customer, Product, Shipment), not by the application that created it (e.g., "The Billing Database" or "The App Logs").
The Old Way: You had a table called
BILLING_SYSTEM_USERSand another calledWEBSITE_REGISTRATIONS.The Modern Data Engineering Way:
Domain-Driven Design (Data Mesh): Today, we build "Data Products." You create a Customer 360 table that combines the billing user and the website user into a single logical entity.
The "Conformed Dimension": In tools like dbt, you create a model called
dim_customers. It doesn't matter if the data came from Salesforce, Stripe, or a CSV; in the warehouse, it is just "Customer."
Modern Rule: If you are building tables named after source systems (e.g.,
salesforce_contacts), you are building a staging layer, not a warehouse. The warehouse layer should be nameddim_customers.
2. Integrated
Inmon’s Definition: Data must be cleaned, standardized, and reformatted before it enters the warehouse. There must be one "version of the truth."
The Problem: Source A lists gender as "m/f"; Source B uses "1/0"; Source C uses "Male/Female."
The Modern Data Engineering Way:
The Silver Layer (Lakehouse Architecture): In a Databricks/Delta Lake architecture, you have a "Bronze" layer (raw data). The "Integrated" step happens when you move data to "Silver." You force all gender columns to map to a standard ISO code (e.g.,
ISO 5218).Schema Enforcement: Modern file formats like Parquet or Avro enforce data types (e.g., ensuring a date is actually a date) so downstream dashboards don't break.
3. Time-Variant
Inmon’s Definition: Every record must track time. You are not just storing the current state; you are storing a film strip of history.
The Old Way: Overwriting a record. If a user changes their email, the old email is gone forever.
The Modern Data Engineering Way:
SCD Type 2 (Slowly Changing Dimensions): This is the standard industry term. If a user moves from "New York" to "London," you do not delete "New York." You add a new row:
Row 1:
User: John,City: NY,Valid_From: 2021,Valid_To: 2023Row 2:
User: John,City: London,Valid_From: 2023,Valid_To: NULL(Current)
Partitioning: In Amazon S3 or HDFS, you physically organize files by date:
s3://bucket/data/year=2023/month=12/day=01/. This is the physical implementation of "Time-Variant."Table Formats (Iceberg/Delta): These technologies have "Time Travel" features that allow you to query the data exactly as it looked "last Tuesday at 4 PM."
4. Non-Volatile
Inmon’s Definition: Once data enters the warehouse, it should never change (except for correcting errors). You only add new data; you rarely delete or update old data.
The Why: If a data scientist runs a model today, they should get the same result if they run it again next month on the same historical data.
The Modern Data Engineering Way:
Immutable Logs: Modern architecture treats data as an "Append-Only" log.
Write-Once-Read-Many (WORM): In cloud storage (S3/Azure Blob), you often set policies so that once a parquet file is written, it cannot be modified, only read. If data changes, you write a new file that supersedes the old one (this is how Snowflake micro-partitions work under the hood).
To a modern Data Engineer, these aren't just academic terms—they are the blueprint for building reliable Data Lakes and Lakehouses. Here is how they translate to today's stack (Snowflake, Databricks, BigQuery, dbt).
Subject Orientation
We touched on this briefly, but here is the deeper engineering context. Inmon argues that operational systems (Legacy/OLTP) are designed around Processes (shipping a box, issuing an invoice), whereas Data Warehouses must be designed around Nouns (Customer, Product, Shipment).
The Engineering Challenge: You will find that "Customer" data is scattered across 5 different systems (Sales, Support, Marketing, Billing, Website).
The Inmon Rule: You must perform integration logic to map all 5 of these sources into a single
Customer_Subject_Areain the warehouse.Modern Context: This is exactly what Master Data Management (MDM) and modern transformation tools like dbt do. You take raw data (
raw_salesforce,raw_stripe) and model them into a "conformed" table (dim_customers).
The "Day 1 to Day n" Phenomenon
This is Inmon’s warning about Scalability and Performance degradation.
The Concept:
Day 1: The data warehouse is fresh. There is very little data. Queries run instantly. Everyone is happy.
Day n: Two years later, the warehouse has huge amounts of historical data. The same queries that took seconds now take hours. Users stop using the system because it is too slow.
The Cause: Inmon argues this happens because engineers design for the functionality of the query, but fail to design for the volume of the data.
The Solution: You cannot just "index your way out" of this. You must design the architecture to handle massive scans from the start. This leads directly to the concepts of Granularity and Partitioning.
Granularity
Inmon states that Granularity is "The single most important design issue in the data warehouse environment."
Definition: Granularity refers to the level of detail or summary in the units of data.
High Granularity (Atomic): Every single transaction (e.g., "John bought a coffee at 9:02 AM").
Low Granularity (Summarized): Aggregated data (e.g., "John spent $50 in total today").
The Trade-off:
High Granularity: immense storage cost, slower queries, but maximum flexibility (you can answer any question).
Low Granularity: fast queries, low storage, but low flexibility (you can't answer "what time did John buy coffee?").
Inmon’s Rule (Dual Granularity): Inmon proposes a dual approach. You store atomic (high granularity) data in the deep storage of the warehouse for auditability/flexibility, but you create lightly summarized tables for frequent user queries to solve the "Day 1 to Day n" performance issue.
Modern Context: This is the standard "Bronze/Silver/Gold" pattern.
Silver Layer: Atomic data (High Granularity).
Gold Layer: Aggregated data (Low Granularity) for dashboards.
The "Living Sample" Database
Inmon recognized that scanning the entire warehouse just to test a hypothesis is expensive and slow (especially on Day n).
The Concept: Create a smaller, representative subset of the data (a "Living Sample") for Data Scientists and Analysts to explore.
The Workflow:
The Analyst runs queries on the Sample Database (fast, cheap).
They find a pattern or build a model.
Only then do they run the query against the full Enterprise Warehouse to verify it at scale.
Modern Context:
Cost Control: In BigQuery or Snowflake, scanning 10TB of data costs real money.
Implementation: We use features like
TABLESAMPLE SYSTEM (10 PERCENT)or create "Dev/Sandbox" environments with only 1 month of data for developers to work in.
Partitioning of Data
Partitioning is how you manage the physical storage of data to handle the "Day n" volume. You break one massive table into smaller, manageable physical units.
How to Partition: Inmon suggests partitioning by:
Date: (e.g., 2022 Data vs 2023 Data). This is the most common.
Geography: (e.g., North America data vs. Europe data).
Business Unit: (e.g., Consumer vs. Enterprise).
The Benefits:
Performance: If a user asks for "Sales in May," the database skips January-April (Partition Pruning).
Management: You can archive old partitions (move 2010 data to "Cold Storage") without taking the database offline.
Modern Context:
Hive Partitioning: In Data Lakes (S3/Parquet), you physically see folders:
/sales/year=2023/month=05/.Micro-partitioning: Snowflake does this automatically, but the concept is pure Inmon: break the data into small chunks so you only read what you need.
Structuring Data in the Data Warehouse
Inmon argues that you cannot just dump data into a table; you must structure it based on how it will be accessed. He defines distinct structural types:
Simple Cumulative: Storing data by day (e.g., Daily Sales). This is the most common atomic level.
Rolling Summary: Storing data by week/month. As new data comes in, you aggregate it up.
Simple Direct: Static data that doesn't change often (e.g., a list of Store locations).
Continuous: Data that represents a span of time (e.g., Employee Contract: Jan 1 to Dec 31).
The Modern Engineering Take:
Schema Design: This maps directly to Fact Tables (Simple Cumulative) vs. Dimension Tables (Simple Direct) vs. Aggregated Materialized Views (Rolling Summary).
Optimization: You structure data differently for "Write" performance (Log files) vs. "Read" performance (Columnar formats like Parquet/ORC).
Auditing and the Data Warehouse
Inmon warns against the impulse to "audit everything."
The Trap: Engineers often try to log every single user query (
SELECT *) to see who is doing what.Inmon’s Rule: Do not audit reads. Auditing every access slows the database down significantly and generates massive logs that nobody looks at.
What to Audit: Only audit activity that changes data (ETL jobs, schema changes, updates).
Modern Context:
In Snowflake/BigQuery, the platform automatically logs query history (metadata), so you don't have to build it yourself.
Data Lineage: We now focus on auditing the pipeline—if the numbers are wrong, we need to know which script ran at 2:00 AM, not which user queried the table at 4:00 PM.
Auditing and the Data Warehouse
Inmon warns against the impulse to "audit everything."
The Trap: Engineers often try to log every single user query (
SELECT *) to see who is doing what.Inmon’s Rule: Do not audit reads. Auditing every access slows the database down significantly and generates massive logs that nobody looks at.
What to Audit: Only audit activity that changes data (ETL jobs, schema changes, updates).
Modern Context:
In Snowflake/BigQuery, the platform automatically logs query history (metadata), so you don't have to build it yourself.
Data Lineage: We now focus on auditing the pipeline—if the numbers are wrong, we need to know which script ran at 2:00 AM, not which user queried the table at 4:00 PM.
Data Homogeneity and Heterogeneity
This describes the core function of ETL (Extract, Transform, Load).
Heterogeneity (The Source): Source systems are messy. One app uses "ID"; another uses "Social Security Number." One uses "inches"; another uses "centimeters."
Homogeneity (The Warehouse): The Warehouse must be homogeneous.
The Engineering Mandate: You are not a "Data Mover"; you are a "Data Unifier." If you load data from two sources into the warehouse and they still have different formatting, you have failed. You have simply moved the mess from one place to another.
Purging Warehouse Data
Inmon points out that "Non-Volatile" (never deleting) does not mean "Keep Forever."
The Problem: Infinite data growth leads to infinite cost and performance degradation.
The Solution: You need a Purging Policy (or Archiving Policy) defined before you build the table.
Modern Context (GDPR/CCPA):
Today, purging is often a legal requirement, not just a storage issue. The "Right to be Forgotten" means you must have a mechanism to purge specific user data.
Lifecycle Rules: In AWS S3 or Azure Blob, we set "Lifecycle Policies" to automatically move data >5 years old to "Cold Storage" (Glacier) to save money.
Reporting and the Architected Environment
Inmon draws a hard line between the Data Warehouse and Reporting.
The Concept: The Data Warehouse is for storing data. It is NOT for formatting reports.
The Mistake: Writing SQL queries in the warehouse that format dates as strings (
'Dec-01-2023') or add visual formatting.The Rule: The Warehouse should provide raw numbers. The Reporting Tool (PowerBI, Tableau) handles the colors, fonts, and formatting.
Why? If you format data in the warehouse, you lock it into that format. If a user wants to change the date format, they have to ask the Data Engineer to change the code.
The Operational Window of Opportunity
This concept drives the need for Streaming Data.
The Definition: The "Window" is the time slot available to run your batch ETL jobs (usually at night) when the business is sleeping.
The Problem:
Data volumes grow (jobs take longer).
Businesses go global (no more "night time").
The Window closes.
Modern Context:
When the "Window" hits zero, you must move from Batch Processing (Airflow jobs running once a day) to Micro-Batching (Spark Streaming) or Real-Time Streaming (Kafka/Flink). This is a major inflection point in a Data Engineer's career.
Incorrect Data in the Data Warehouse
What happens when you find a mistake in the Warehouse 6 months later?
The Constraint: You cannot just run an
UPDATEstatement to fix the number, because that violates the "Non-Volatile" principle and destroys the audit trail.Inmon’s Solution: You must issue an Adjustment Transaction.
Bad Way: Change
$100to$120.Inmon Way: Keep the original
$100record. Insert a new record of+$20with a reason code "Correction."
Modern Context: This is standard accounting practice applied to data. It ensures that if you sum the data by date, you can explain exactly when the mistake was fixed.
Summary for your mental model from the chapter 2
Concept
The "Day 1" Engineering Mistake
The "Inmon" Professional Approach
Structure
Dumping raw JSON logs into a table.
Structuring data into Fact/Dimension or Cumulative/Rolling tables.
Auditing
"Log every query just in case."
"Log only changes and schema updates."
Purging
"We'll figure out deletion later."
"Define the lifecycle policy (S3/Glacier) on Day 1."
Reporting
Formatting strings/dates in SQL.
Sending raw data; let Tableau handle the formatting.
Window
Running 10-hour jobs that crash the server.
Moving to Streaming or Micro-batching when the window closes.
Errors
UPDATE sales SET amount = 120
INSERT INTO sales (amount, type) VALUES (20, 'adjustment')
Chapter 3: The Data Warehouse and Design
This is the chapter where Inmon stops talking about "Philosophy" and starts teaching you "How to Build It."
Chapter 3 is the battleground where the "Inmon vs. Kimball" debate is most visible. Inmon argues for a Corporate Data Model (Normalized/3NF) before you ever build a dashboard.
Beginning with Operational Data
The Concept: You start with the operational systems (Legacy/OLTP). However, Inmon warns that operational data is application-centric, not data-centric. It is optimized to make a specific transaction run fast (e.g., "Checkout Cart"), not to make the data understandable.
The Engineering Task: You cannot simply "lift and shift" operational tables into the warehouse. You must strip away the application-specific logic.
Modern Context:
Raw vs. Trusted: This is the move from the "Bronze" layer (Raw Copy) to the "Silver" layer (Cleaned).
The Trap: If you copy a table called
tbl_users_v2_final_backupinto your warehouse, you have failed. You must rename and remodel it todim_customers.
Process vs. Data Models
The Concept:
Operational Systems are defined by Process Models (Functional requirements: "We need a function to calculate tax").
Data Warehouses are defined by Data Models (Enterprise definitions: "What is tax? Is it a separate entity or an attribute of a sale?").
The Inmon Rule: The Data Warehouse does not care about how data is processed; it cares about how data is related.
Modern Context: As a Data Engineer, stop thinking like a backend developer (Loops/Functions) and start thinking like an Architect (Entities/Relationships).
The Data Warehouse and Data Models (The Hierarchy)
Inmon breaks modeling into three distinct phases. In an interview, knowing this hierarchy shows you understand the lifecycle of design.
A. The High-Level Data Model (ERD)
What it is: A pencil-and-paper drawing. No keys, no types, just boxes and lines.
The Goal: Identify the major "Subjects" (Customer, Product, Order) and how they relate.
Modern Context: This is the "Whiteboarding" session you have with stakeholders before writing any SQL.
B. The Midlevel Data Model (Logical)
What it is: You add details.
Keys: Primary Keys (PK) and Foreign Keys (FK).
Attributes: (e.g., Customer Name, DOB).
Normalization: Inmon insists the Warehouse layer must be in 3rd Normal Form (3NF) to remove redundancy.
Why 3NF? If a customer address is stored in 50 different order records (denormalized), and they move, you have to update 50 rows. If it's normalized (linked by ID), you update 1 row.
C. The Physical Data Model
What it is: The actual DDL (Create Table statements).
The Engineering Decisions: This is where you decide on data types (
VARCHARvsTEXT), partitioning strategies, and indexing.Modern Context: This is your dbt
schema.ymlfile or your Terraform configuration.
Normalization and Denormalization
This is the most controversial part of the book for modern engineers.
Inmon’s Stance:
The Warehouse (EDW): MUST be Normalized (3NF). This ensures data integrity and consistency.
The Data Marts: Can be Denormalized (Star Schema) for performance.
Why this is hard: 3NF requires complex joins. Queries are harder to write.
The "Snapshot" in the Data Warehouse:
Since you are normalizing, how do you handle history?
The Snapshot: You create a picture of the data at a specific moment in time.
Engineering Example: instead of just a
current_balancecolumn, you create a snapshot tabledaily_account_balanceswhere every day you insert a new row for every account.
Key Takeaway: Inmon uses Normalization to preserve truth, and Denormalization to optimize delivery.
Metadata (The "GPS")
Inmon correctly predicted that without metadata, a warehouse becomes a "Data Swamp."
Two Types of Metadata:
Technical Metadata: Database structure, data types, transformation logic (ETL mappings).
Business Metadata: Definitions (e.g., "What is the definition of 'Gross Profit'?").
Managing Reference Tables: Small tables (Country Codes, Currencies, Status IDs) are metadata too. They must be managed centrally, or your joins will break.
Modern Context:
Data Catalogs: Tools like Alation, Atlan, or DataHub.
dbt Docs: The modern standard for documenting what your SQL models actually do.
The Data Model and Iterative Development
Inmon attacks the "Waterfall" method (trying to build the whole warehouse at once).
The Concept: You cannot model the entire enterprise in one go. You will fail.
The Spiral Approach:
Build the data model for ONE subject (e.g., Sales).
Implement it.
Get feedback.
Move to the next subject (e.g., HR) and integrate it with Sales.
Modern Context: This is effectively Agile Data Engineering. We build "Vertical Slices" or "Data Products" rather than a monolithic warehouse all at once.
Summary Checklist so far
If you are asked about Inmon's Design principles, here is the script:
Don't Clone: Operational data is for apps; Warehouse data is for business entities.
Normalize the Core: The central warehouse should be 3NF (Normalized) to ensure a single source of truth.
Denormalize the Edge: Data Marts can be denormalized for speed.
Snapshots: Use snapshots to capture historical states of changing data.
Iterate: Build subject by subject, not all at once.
You are now looking at the latter half of Chapter 3, and this is where Inmon transitions from "How to store data" to "How to use data."
This section is surprisingly forward-looking. He discusses concepts that (decades later) became hot trends like Reverse ETL and Feature Stores.
Cyclicity of Data — The Wrinkle of Time
The Concept: Inmon warns that data doesn't just grow linearly; it has cycles. Business data is often driven by the Fiscal Year, not just the Calendar Year.
The Engineering Challenge: If you only store
2023-12-09, you might fail to report Q4 earnings correctly if the company’s fiscal year ends in January.Modern Context:
The Date Dimension: This is why every Data Warehouse has a massive
dim_datetable. It maps a standard date (2025-12-09) to business attributes (Fiscal_Quarter_4,Holiday_Season_Flag,Week_49).Partitioning: You often have to decide whether to partition storage by "Event Time" (when it happened) or "Processing Time" (when you received it).
Complexity of Transformation and Integration
The Concept: Inmon emphasizes that Integration is the most expensive part of the warehouse. It is not just copying files.
The Complexity: It involves:
Key Translation: Mapping
user_id_123(App A) tocust_id_abc(App B).Encoding Conversion:
ASCIItoUnicode(orUTF-8).Unit Conversion: Metric vs. Imperial.
Modern Context:
ELT vs. ETL: Inmon lived in an ETL world (Transform before Load). Today, we use ELT (Load raw data first, then Transform in the warehouse using dbt/SQL).
The Warning: Just because tools like Fivetran make "Loading" easy, doesn't mean the "Integration" logic disappears. It just moves to the SQL layer.
Triggering the Data Warehouse Record
Inmon distinguishes between two ways data enters the warehouse. This determines your table structure.
A. Events (Discrete)
What it is: A specific action happens at a specific moment.
Example: "Customer clicked 'Buy'."
Modern Term: Event Stream / Fact Table.
Engineering Rule: These tables grow indefinitely. They are immutable (you can't "un-click" a button).
B. Components of the Snapshot
What it is: The state of an object at a specific point in time.
Example: "The Account Balance at 5:00 PM."
Modern Term: Periodic Snapshot Fact Table.
Engineering Rule: Even if no transactions happened today, you still might record a snapshot to show the balance remained
$100.
Profile Records
This is a very specific Inmon concept that predicts modern Machine Learning Data Engineering.
The Concept: Scanning 10 million transactions to see if a customer is "high value" is too slow. instead, you create a Profile Record.
How it works: You create a table that aggregates the customer's behavior.
Customer ID: 101
Total Spend: $5000
Favorite Category: Shoes
Last Visit: Yesterday
Modern Context:
Feature Stores: In modern ML Ops (using tools like Feast or Databricks Feature Store), this is exactly what we build. We pre-calculate features so the AI model doesn't have to compute them in real-time.
Customer 360: This is the "Gold Layer" table used by Marketing teams.
Managing Volume & Multiple Profile Records
The Problem: Profiles change. A customer who liked "Shoes" in 2020 might like "Electronics" in 2025.
The Solution: You don't just keep one profile; you keep a history of profiles (Rolling snapshots).
Modern Context: This leads to SCD Type 2 on aggregated tables.
Going from the Data Warehouse to the Operational Environment
STOP AND PAY ATTENTION HERE. Inmon is describing Reverse ETL—a concept that only became "mainstream" in the 2020s (with tools like Hightouch and Census), yet he wrote about it decades ago.
The Concept: Usually, data flows
Ops -> Warehouse. But sometimes, the Warehouse has intelligence that the Operational system needs.
Direct vs. Indirect Access
Direct Operational Access (The "Don't Do It" Rule):
Scenario: The Checkout App queries the Data Warehouse directly to see if the user is eligible for a discount.
Why it fails: The Warehouse is designed for heavy analysis, not sub-second latency. If the Warehouse is slow, the Checkout App crashes.
Indirect Access (The "Reverse ETL" Pattern):
Scenario: The Warehouse calculates the discount eligibility overnight and pushes a flag back to the Checkout App's database.
Why it works: The Operational App is just reading its own local database (fast), but the intelligence came from the Warehouse.
The Examples (And why they matter)
Airline Commission Calculation:
Problem: Calculating travel agent commissions is insanely complex (based on miles, routes, special deals). The Transaction system is too busy selling tickets to do this math.
Inmon’s Solution: Move the raw data to the DW. Run the complex math there. Send the final check amount back to the Finance App.
Retail Personalization:
Problem: A store wants to print a coupon at the register based on your last 5 years of purchases.
Inmon’s Solution: The register (POS) cannot scan 5 years of history. The DW does the scanning, creates a "Coupon Code" for that customer, and sends just the code back to the POS system.
Credit Scoring:
Problem: Assessing risk involves checking 10 different systems.
Inmon’s Solution: The DW integrates the data, runs the risk model, and sends the "Credit Score" back to the Loan Application.
Based on the information we've learned so far, Data Activation and Reverse ETL can be referenced using Inmon:
"Inmon teaches that the Data Warehouse isn't just a grave for data. It's an engine for calculation. We use Profile Records (Feature Engineering) to summarize behavior, and we use Indirect Access (Reverse ETL) to push that intelligence back into operational systems like CRMs or ERPs without slowing them down."
The next section wraps up the design phase by addressing three specific architectural "edges": The Star Schema, the ODS, and the Zachman Framework.
Star Joins
Inmon’s Stance: This is the most famous point of conflict in data warehousing.
The Concept: A "Star Join" (or Star Schema) joins a central Fact Table to multiple Dimension Tables. It is optimized for reading data fast.
Inmon's Rule: Star Joins are great for Data Marts (the serving layer) but terrible for the Enterprise Warehouse (the storage layer).
Why? Star schemas are rigid. If you design a star schema around "Sales," it is hard to answer questions about "Inventory" later without redesigning the whole thing.
Modern Context:
The Hybrid Model: In modern Data Lakes (Databricks/Snowflake), we follow Inmon here. We keep the core data in normalized or wide tables ("Silver Layer") and only build Star Schemas ("Gold Layer") for PowerBI/Tableau to consume.
2. Supporting the ODS (Operational Data Store)
The Concept: The ODS is a hybrid beast. It is Integrated (clean data) like a warehouse, but Volatile (current value only) like a database.
Use Case: A call center operator needs to see a customer’s current status across all accounts instantly. They don't need 5 years of history; they need right now.
The Flow:
Sources -> ETL -> ODS -> Warehouse.Modern Context:
Real-Time Analytics: Today, we often implement the ODS using fast NoSQL databases (Redis, DynamoDB) or high-speed SQL engines (Postgres) to serve apps, while the Warehouse (Snowflake) handles the history.
3. Requirements and the Zachman Framework
The Concept: This is "Enterprise Architecture" 101. John Zachman created a 6x6 grid to classify every artifact in IT.
The Grid: It asks: Who, What, Where, When, Why, How? across different levels (Planner, Owner, Designer, Builder).
Why it matters to Inmon: He argues that a Data Warehouse is the only way to satisfy the "Column 1 (Data)" requirements across the whole enterprise.
Modern Context: You likely won't use the full Zachman grid today, but the principle of "Separation of Concerns" remains vital. Don't mix business logic (Why) with database DDL (How).
Chapter 4: Granularity in the DW
This is the most "Physics" based chapter. It deals with space, time, and volume. Inmon argues that Granularity (the level of detail) is the single most important design decision because it dictates storage costs and query speed.
1. Raw Estimates & Input to the Planning Process
The Problem: Before you build, you must estimate the size.
The Math:
Rows per dayxRow size (bytes)xHistory needed (days)=Total Storage.Inmon's Warning: If the estimate is 100TB, you cannot query it all at once. You must summarize it.
Modern Context:
Cloud Costs: In the cloud, storage is cheap, but compute is expensive. If you store too much granular data, your
SELECT *query might cost $50 instead of $0.50. Estimating volume helps you set budgets.
2. Data in Overflow & Overflow Storage
The Concept: What do you do with data that is old and rarely accessed, but you legally can't delete it?
Inmon's Solution: Move it to "Overflow Storage"—cheaper, slower disk.
Modern Context (Tiered Storage): This is standard practice in AWS/Azure.
Hot: NVMe SSDs (Current year data).
Warm: Standard S3 (Last 3 years).
Cold/Overflow: S3 Glacier / Deep Archive (Older than 5 years).
Engineering Note: You must build "Lifecycle Policies" to move this data automatically.
3. What the Levels of Granularity Will Be
The "Dual Granularity" Pattern: Inmon suggests you need TWO levels of granularity in the warehouse to survive.
True Archival (Atomic): Every single transaction. Stored on cheap/overflow storage. Accessed rarely.
Lightly Summarized (Aggregated): Summarized by day/customer. Stored on fast disk. Accessed 95% of the time.
Why? This solves the "Day n" performance problem.
Modern Context: This is the Materialized View. You store the raw logs (Atomic) but serve a view that pre-calculates daily totals (Summarized) so dashboards load instantly.
4. Levels of Granularity — Banking Environment (Example)
The Example: In a bank, you have millions of ATM transactions (withdraw $20, withdraw $40).
Bad Design: Storing every ATM transaction in the primary query table.
Good Design (Inmon):
Table A (Summarized): "John's Daily Total Withdrawals: $60" (Fast to query).
Table B (Atomic): The individual logs of the $20 and $40 transactions (Kept in deep storage for fraud audit only).
5. Feeding the Data Marts
The Concept: The Data Marts (Marketing, Sales) should not calculate their own metrics from raw data. They should pull from the Summarized layer of the warehouse.
Why? Consistency. If Marketing calculates "Monthly Sales" from raw data, and Finance calculates it separately, they will get different numbers. If they both pull from the Warehouse's "Monthly Summary," the numbers match.
Modern Context: This is the concept of "Publishing Data Products." The Data Engineering team publishes a clean, aggregated table for the analysts to use.
When it comes to Scaling and Storage, use Inmon’s logic:
Estimate First: Do the math on row counts before creating the table.
Dual Granularity: Always keep the raw atomic data (for audit), but build a summarized layer for performance.
Overflow Strategy: Have a plan for moving old data to cold storage (Glacier) so it doesn't clog the active system.
Star Schemas: Use them at the end (Data Marts), not the beginning (EDW).
Chapter 6: The Distributed Data Warehouse
It is arguably the most visionary chapter in the book. In the 90s, most companies tried to put everything on one giant mainframe. Inmon saw that eventually, data would become too big and complex for one box, requiring it to be distributed across different locations and technologies.
To a modern Data Engineer, this is the blueprint for "Data Mesh" and Multi-Cloud Architectures.
Types of Distributed Data Warehouses
Inmon starts by categorizing why a warehouse would be distributed. It’s not always for the same reason.
Business Distribution: Different divisions (e.g., "Insurance Division" vs. "Banking Division") run independently.
Geographical Distribution: One office is in New York, another in London.
Technological Distribution: Different platforms (e.g., some data on a Mainframe, some on a fast Unix server).
Modern Context: Today, we call this Multi-Cloud (using AWS and Azure together) or Hybrid Cloud (On-premise + Cloud). The challenge is exactly what Inmon describes: how do you query across them?
Local vs. Global Data Warehouses (The "Data Mesh" Core)
This is the heart of the chapter. Inmon argues that you cannot force every piece of data into a single global schema. You need a two-tier approach.
A. The Global Data Warehouse
Scope: The entire enterprise.
Content: Common data used by everyone (e.g., Customer ID, Total Revenue, Product List).
Goal: Consistency. If the CEO runs a report, it comes from here.
B. The Local Data Warehouse
Scope: A specific site, region, or department.
Content: Data only relevant to that site.
Example: The "France" office needs to track specific French labor laws. The "USA" office doesn't care about that.
Goal: Autonomy and Speed. The French team shouldn't have to ask HQ permission to add a column for a local regulation.
C. The Intersection (Mapping)
The Concept: Some data exists in both places.
The Inmon Rule: The "Global" definition wins.
Local (France): Currency = Euros.
Global (HQ): Currency = USD.
Intersection: The Data Engineering pipeline must convert the Local Euro value to the Global USD value before it enters the Global Warehouse.
Redundancy
Inmon addresses the elephant in the room: "Wait, aren't we storing the same data twice?"
The Answer: Yes, and that is a good thing if controlled.
Controlled Redundancy: You store "Sales" in the Local DW for fast operational reporting in that region. You copy it to the Global DW for enterprise aggregation.
The Risk: If you update the Local version but the Global version doesn't get the update, you have Data Drift.
Modern Context: This is Data Replication or Caching. We frequently replicate data from a regional Snowflake account to a central "Headquarters" account. Managing the lag between them is a key engineering task.
The Technologically Distributed Data Warehouse
The Concept: Sometimes, distribution isn't a choice; it's forced by hardware limits. You physically cannot fit the data on one machine.
Inmon's Insight: You place the data where the processing power is needed.
Modern Context (Edge Computing): Today, this applies to IoT (Internet of Things). You might have a "Warehouse" on a factory floor collecting sensor data (Local), and only send the daily summaries to the Cloud (Global) because the bandwidth is too expensive to send every millisecond of data.
The Independently Evolving Distributed Data Warehouse
The Scenario: Two parts of the company build their own warehouses without talking to each other.
The Result: "stovepipe" or "siloed" systems. They might use the same terms ("Customer") to mean different things.
Inmon’s Warning: This is the most dangerous state. It usually happens after Mergers & Acquisitions.
The Fix: You don't delete them. You build a "Translation Layer" (middleware) to map them to a common standard over time.
Completely Unrelated Warehouses
The Concept: Sometimes, warehouses should be unrelated.
Example: If a conglomerate owns a "Cookie Company" and a "Steel Manufacturing Company," there is zero overlap in their data.
Engineering Decision: Don't try to integrate them just for the sake of it. Let them be separate.
Summary Checklist so far
If you are discussing System Topology or Data Mesh in an interview:
Don't Centralize Everything: Inmon teaches that a "Global" warehouse should only contain shared, enterprise-wide data.
Respect Locality: "Local" warehouses allow regions/departments to move fast without breaking the global model.
Manage Redundancy: Storing data in two places (Edge and Cloud) is acceptable if you have a synchronization process.
The Intersection: The hardest engineering work is at the "Intersection"—mapping the local definition to the global definition.
You have successfully reached the final stretch of the book's technical core. The next subchapters cover the complex human and technical coordination required when a Data Warehouse is split across different teams, locations, and platforms.
To a modern Data Engineer, these sections are the "Governance" and "Team Topology" chapters. Inmon explains how to prevent a distributed system from becoming a chaotic mess.
Distributed Data Warehouse Development
The Concept: When you have multiple teams building different parts of the warehouse (e.g., Team A in London, Team B in New York), you cannot just let them code in isolation. If you do, their data will never integrate.
Coordinating Development:
Inmon’s Rule: You need a central coordination body. Development must be synchronized so that when Team London changes the
ProductID format, Team New York knows about it.Modern Context: This is the role of the Data Center of Excellence (CoE) or the Platform Team. They set the standards (naming conventions, tools) that all distributed teams must follow.
The Corporate Data Model — Distributed
The Concept: Even if the data is distributed physically, the definition of the data must be centralized logically.
The Strategy: You build a "Corporate Data Model" that acts as the constitution. Every local team can add their own local columns, but they must use the Corporate definition for core entities like "Customer" or "Revenue".
Modern Context (Data Contracts): Today, we use Data Contracts. Before a team publishes a table, they sign a "contract" (often a JSON/YAML schema) guaranteeing the structure will not change unexpectedly.
Metadata in the Distributed Warehouse
The Concept: In a single database, you can just query
information_schemato see what tables exist. In a distributed environment, you are blind. You don't know what tables exist on the server in the other country.Inmon’s Solution: You must actively replicate metadata. A central repository must "know" where everything is.
Modern Context: This is the primary use case for Data Catalogs (e.g., Alation, Collibra, DataHub). They crawl all your distributed environments (Snowflake, AWS, On-Prem) and show you a single map of your data estate.
Building the Warehouse on Multiple Levels
This section addresses the complexity when different organizational levels (e.g., "The Holding Company" vs. "The Subsidiary") both need to build warehouses.
Multiple Groups Building the Current Level of Detail
The Problem: Sometimes, different groups need to capture the same detailed data but for different reasons.
Example: The "Shipping Dept" logs a shipment to track the truck. The "Finance Dept" logs the same shipment to recognize revenue.
The Risk: You end up with two "Shipment" tables that don't match.
Inmon’s Solution: You must identify the "System of Record." Ideally, one group captures it, and the other group consumes it. If both must capture it, they must reconcile continuously.
Modern Context: This is Data Mesh thinking. We define Domain Ownership. We say, "The Shipping Domain owns the Shipment table. Finance is just a consumer."
Different Requirements at Different Levels
The Concept: The "Corporate" level usually needs aggregated data (monthly totals) for strategy. The "Local" level needs granular data (individual receipts) for operations.
Engineering Impact: You don't force the Local team to use the Corporate data model, because it's too high-level. You allow them to build their own granular models, provided they can roll up to the Corporate aggregate.
Multiple Platforms for Common Detail Data
The Concept: Inmon acknowledges that you might use different technologies for the same data depending on its age or usage.
Example: Current detailed data lives on a high-speed SQL server. 5-year-old detailed data lives on a slower, cheaper tape drive.
The Engineering Challenge: The user shouldn't have to know where the data is.
Modern Context (Polyglot Persistence):
Hot Data: Snowflake/Databricks (Fast, Expensive).
Cold Data: S3 Glacier (Slow, Cheap).
The "Lakehouse": Tools like Trino (Presto) allow you to write one SQL query that joins data from the fast database and the slow archive transparently.
Summary of the Final Section
If you are asked about Governance in a Distributed Environment:
"Inmon teaches that while we can distribute the storage and compute of data (to local teams or regions), we cannot distribute the definition of data. We need a 'Corporate Data Model' (Data Contract) and a central 'Metadata Repository' (Data Catalog) to ensure that valid, distributed autonomy doesn't turn into anarchy."
Conclusion of the Guide
You have now successfully navigated the core technical chapters of Building the Data Warehouse:
Architecture: You understand the separation of Operational vs. Informational data.
Modeling: You know why 3NF is used for the warehouse and Star Schema for the marts.
Physics: You know how to estimate size, handle granularity, and manage partitions.
Topology: You understand how to manage distributed data across different locations and teams.
You have now completed the "Fast Track" reading plan for Building the Data Warehouse:
Ch 1: The "Why" (Operational vs. Informational).
Ch 2: The "What" (Subject-Oriented, Integrated, Time-Variant, Non-Volatile).
Ch 3: The "Design" (Normalized EDW vs. Denormalized Marts).
Ch 4: The "Physics" (Granularity and Partitioning).
Ch 6: The "Topology" (Distributed/Global vs. Local).
The "Big Bang" vs. "Phased" Migration Strategy
In Inmon’s view, this is not just a project management choice; it is a survival strategy.
In traditional software development (like building an iPhone app), you use the SDLC (System Development Life Cycle): Requirements -> Design -> Code -> Test -> Deploy. You ask the user what they want first, and then you build it.
Inmon argues that for Data Warehousing, this approach is fatal.
1. The "Big Bang" Failure Mode
The Approach: "Let's gather all requirements from Marketing, Sales, and Finance, design the perfect enterprise model, and launch the warehouse in 2 years."
Why Inmon Hates It:
The Unknown: Users don't know what questions they want to ask until they see the data. If you ask for requirements upfront, they will guess, and they will be wrong.
The Moving Target: By the time you finish the 2-year build, the business has changed. The warehouse launches and is immediately obsolete.
2. The "Phased" (Data-Driven) Approach
Inmon proposes reversing the SDLC into what he calls CLDS (Data-Driven Development).
The Strategy: Don't ask "What reports do you need?" Ask "What data do we have?"
The Loop:
Integrate Data: Pick ONE subject (e.g., "Customer") and integrate it into the warehouse.
Test/bias: Give users access to this data.
Gather Requirements: Watch what they do with it.
Iterate: Add the next subject (e.g., "Sales").
The "Spiral" Methodology: You build the warehouse in small, vertical slices (3-month cycles). You might start with just Sales. Once that is stable and providing value, you add Inventory. This matches the modern Agile/Scrum philosophy perfectly.
The "Inmon Philosophy" Diploma
The Concept
The Inmon Answer (Your Mental Model)
"Where do we start?"
The Spiral (CLDS): Don't Big Bang it. Pick one subject (e.g., Sales), integrate it, normalize it, release it, learn, and then add the next subject.
"How do we model?"
3NF First: The Core Warehouse must be normalized to preserve the "Single Version of the Truth." Data Marts can be Star Schemas, but they are outputs, not the foundation.
"How do we scale?"
Granularity & Partitioning: We survive "Day n" data volumes by splitting data into "Atomic" (Raw/Deep Storage) and "Lightly Summarized" (Fast/Aggregated) layers.
"How do we manage distributed teams?"
Local vs. Global: Allow local autonomy (Local Warehouses) for speed, but enforce a "Corporate Data Model" at the intersection where data moves to the Global scope.
Last updated