Core concepts from the book "The Data Warehouse Toolkit, 3rd ed."
According to Ralph Kimball, the founder of the dimensional modeling approach, there are four key decisions (or steps) that must be made during the design of a dimensional model. These steps must be performed in this specific order to ensure the model accurately reflects business requirements.
The 4-Step Dimensional Design Process
1. Select the Business Process
The first step is to identify the specific operational activity you want to model. A business process is usually an event that your organization performs, such as "taking an order," "registering a student," or "processing a claim."
Key Note: Do not equate a business process with a business department (like Marketing or HR). Focus on the action/event.
2. Declare the Grain
This is widely considered the most critical step. You must define exactly what a single row in the fact table represents. This is the "atomic" level of your data.
Key Note: If you skip this step, you risk creating a model with mismatched statistics (e.g., trying to store annual budget data in the same table as daily sales transactions). Common examples of grain include "one row per line item on an invoice" or "one row per daily snapshot."
3. Identify the Dimensions
Once the grain is declared, you determine the "who, what, where, when, why, and how" of the business process. Dimensions provide the context for the data and are typically where the descriptive text attributes live (e.g., Customer Name, Product Category, Store Location, Transaction Date).
Key Note: These are the fields your users will use to filter and group their reports.
4. Identify the Facts
The final step is to identify the numeric measurements that result from the business process. These are the numbers you want to analyze, such as "Quantity Sold," "Extended Price," or "Transaction Duration."
Key Note: All facts must be consistent with the grain declared in Step 2. If a fact belongs to a different level of granularity (like a monthly target), it must go in a separate fact table.
Summary for Quick Reference
Step
Action
Question to Ask
1
Select Business Process
What is the operational event?
2
Declare the Grain
What does one row represent?
3
Identify Dimensions
How do we describe the data (Who, What, Where)?
4
Identify Facts
What are we measuring (Counts, Sums, Averages)?
Basic Fact Table Techniques
Here is the breakdown of the Basic Fact Table Techniques. These are the "building blocks" of dimensional modeling. If you master these definitions, you can handle almost any data scenario.
NOTE The designer’s dilemma of whether a numeric quantity is a fact or a dimension attribute is rarely a difficult decision. Continuously valued numeric observations are almost always facts; discrete numeric observations drawn from a small list are almost always dimension attributes.
1. Fact Table Structure
The physical layout of a fact table is very specific and lean. A fact table generally has only two types of columns:
Foreign Keys (FKs): These point to your Dimension tables (e.g.,
Product_ID,Date_ID,Store_ID).Facts (Measures): The numeric data you want to analyze (e.g.,
Sales_Amount,Quantity_Sold).The Rule: Fact tables contain quantitative data; Dimension tables contain descriptive data. Fact tables are deep (millions of rows) but narrow; Dimension tables are shallow (fewer rows) but wide.
2. Additive, Semi-Additive, Non-Additive Facts
This is a critical concept for preventing calculation errors in reports. You must know "when can I sum this column?"
Additive: Can be summed across any dimension.
Example:
Sales_Amount. (Sales today + Sales tomorrow = Total Sales. Sales in NY + Sales in London = Total Sales).
Semi-Additive: Can be summed across some dimensions, but not others (usually Time).
Example:
Account_BalanceorInventory_Level. You can sum all bank accounts to get the "Total Cash in Bank," but you cannot sumBalanceacross time (Balance on Jan 1st + Balance on Jan 2nd Total Balance).
Non-Additive: Cannot be summed at all.
Example:
Unit_PriceorMargin_Percentage. You can't sum prices; you have to average them.
3. Nulls in Fact Tables
Kimball has a very strict rule here: Avoid Nulls in Foreign Keys.
The Problem: If a transaction comes in without a Customer ID, and you leave it as
NULL, your SQL joins will fail or behave unpredictably (inner joins drop the row).The Fix: Replace the
NULLwith a "dummy" dimension key, such as-1or0, which points to a row in the Customer Dimension labeled "Unknown Customer" or "N/A". This ensures you never lose data in a report.
4. Conformed Facts
Just like "Conformed Dimensions," you must standardize your math.
Concept: If you have a measure called
Revenuein the "Sales Fact" table and a measure calledRevenuein the "Executive Dashboard" table, they must be calculated exactly the same way.Why: If the CEO sees "Revenue: $1M" on one report and "Revenue: $900k" on another because one included tax and the other didn't, trust in the data team is destroyed.
5. The Three Core Fact Tables
Transaction: One row per event (e.g., one beep at the checkout). High volume, most detailed.
Periodic Snapshot: One row per period (e.g., daily inventory summary). Good for trends.
Accumulating Snapshot: One row per workflow (e.g., an Order). Has multiple dates (
Order_Date,Ship_Date) in one row. Good for measuring lag/duration.
Three Fundamental Fact Table Types
1. Transaction Fact Tables (Read Chapter 3: Retail Sales)
What it is: Data that records an event at a specific point in time (e.g., a scan at a checkout counter).
Key Concept: This is the "Hello World" of data modeling. It introduces Denormalization (flattening many tables into one Dimension).
2. Periodic Snapshot Fact Tables (Read Chapter 4: Inventory)
What it is: Data that measures the "status" of something at regular intervals (e.g., "How many items were in the warehouse at the end of the day?").
Key Concept: Semi-additive facts. You can sum inventory across products, but you can't sum it across time (Inventory on Monday + Inventory on Tuesday Total Inventory).
3. Accumulating Snapshot Fact Tables (Read Chapter 5: Procurement)
What it is: Data that tracks a workflow with a start and end (e.g., An Order: Placed Shipped Delivered).
Key Concept: This table has multiple date columns (
OrderDate,ShipDate,DeliveryDate) in one row. This is crucial for calculating "lag" (how long did it take to ship?).
6. Factless Fact Tables
A "Factless" fact table is a table that has only Foreign Keys and no numeric measures.
Use Case 1: Event Tracking.
Scenario: A student attends a class. There is no "amount" to sum. You just need to record that Student A was in Class B on Date C.
The Table:
Student_ID,Class_ID,Date_ID. If a row exists, the event happened.
Use Case 2: Coverage/Eligibility.
Scenario: Which products were on promotion today? Even if nothing sold, you need to know the product was eligible for the promo.
The Table:
Product_ID,Promo_ID,Date_ID.
7. Aggregate Fact Tables (OLAP Cubes)
Concept: These are smaller, physically separate tables that store pre-calculated summaries to speed up queries.
Example: Instead of summing 1 billion rows of
Transaction_Factevery time someone asks for "Monthly Sales," you create aMonthly_Sales_Facttable that only has 1 row per month.Modern Note: As mentioned before, modern tools (dbt, materialized views) automate this, but the logic remains essential for performance.
8. Consolidated Fact Tables
Concept: Combining facts from different processes into one table for convenience.
Example: Actual vs. Budget.
Process A generates "Actual Sales" (Transaction Fact).
Process B generates "Budget/Forecast" (Periodic Snapshot Fact).
To compare them, you create a Consolidated Fact Table that aligns them on common dimensions (Year, Region, Product) so you can calculate
Variance = Actual - Budget.
Dimension table techniques
Basic Dimension Table Techniques. These are the rules for how to build the "context" around your data.
1. Dimension Table Structure
Dimension tables are "short and wide." They have relatively fewer rows than fact tables (thousands vs. millions), but many columns (attributes).
Purpose: To provide the "Who, What, Where, When, Why" context for the numbers in the fact table.
The Rule: Dimension tables should be denormalized. Do not try to save space by creating sub-tables (snowflaking). If you have a "Product" dimension, include the "Category Name" directly in that table as a string. Do not create a separate "Category" table.
2. Dimension Surrogate Keys
What it is: A simple integer (1, 2, 3...) created by the data warehouse to uniquely identify a row in a dimension table.
Why use it?
Independence: If the operational system changes its IDs (e.g., switches from integer IDs to UUIDs), your warehouse doesn't break.
Performance: Joining on integers is faster than joining on long strings.
History (SCDs): If a customer changes their name, you need two rows for that customer (one for the old name, one for the new). You cannot do this if you rely on the single "Customer ID" from the source system. You need your own unique Surrogate Key for each row.
3. Natural, Durable, and Supernatural Keys
Natural Key: The ID from the real world/source system (e.g., Social Security Number, SKU, Email Address).
Durable Key: A key that never changes for a business entity, even if the source system changes. Often, the Data Warehouse team has to maintain a mapping table to ensure "John Doe" is always "Person 123" even if he switches from system A to system B.
Supernatural Key: A whimsical term Kimball uses for keys generated solely within the warehouse to handle permanent identification when source systems are unreliable.
4. Drilling Down
Concept: This is the primary action a user takes in a BI tool. "Drilling down" means grouping by a more specific attribute in a dimension.
Example: You start by looking at Sales by Region (North, South). You see a spike in the North. You "drill down" by adding the State attribute (NY, NJ, PA) to see exactly where the spike came from.
Requirement: For drill-down to work, you need hierarchical attributes (Region State City Zip) stored in the same dimension table.
5. Degenerate Dimensions
What it is: A dimension key that sits in the Fact Table but does not join to a dimension table.
Common Use Case: Transaction Numbers (Invoice ID, Order ID).
Why: An Invoice ID is unique to the transaction. If you built a "Dimension Table" for it, it would have 1 billion rows (same as the fact table) and no other useful columns. So, we leave the ID "stranded" in the Fact Table. It's still useful for filtering ("Find Order #555"), but it doesn't need its own table.
6. Denormalized Flattened Dimensions
Concept: As mentioned in "Structure," this is the practice of collapsing hierarchies into one table.
Visual: instead of
Product TableBrand TableCategory Table, you just have one bigProduct Tablewith columns:Product_Name,Brand_Name,Category_Name.
7. Multiple Hierarchies in Dimensions
The Problem: Sometimes a dimension has two valid paths.
Calendar: Day Month Year
Fiscal: Day Fiscal Period Fiscal Year
The Fix: Just store them both as columns in the same Date Dimension table. Do not create separate tables. Let the user choose which hierarchy they want to drag onto the report.
8. Flags and Indicators as Textual Attributes
The Rule: Avoid cryptic Boolean flags like
0or1.The Fix: Translate them into meaningful text.
Instead of
Is_Active: 1, storeStatus: "Active".Instead of
Has_churned: 0, storeChurn_Status: "Retained".
Benefit: It makes dropdown filters in dashboards self-explanatory.
9. Null Attributes in Dimensions
The Rule: Just like in Fact Tables, avoid
NULLin dimension attributes.Why: If a user filters for "Category NOT EQUAL TO 'Hardware'", SQL logic will usually exclude NULL rows too, which might not be what the user intended.
The Fix: Replace
NULLwith a descriptive string like "Unknown", "Not Applicable", or "Pending".
10. Role-Playing Dimensions
Concept: Using the same physical dimension table multiple times in a single query for different purposes.
Example: A Fact Table has
Order_Date_Key,Ship_Date_Key, andDelivery_Date_Key. You do not build three date tables. You build oneDim_Datetable and join to it three times (aliasing it as "Order Date," "Ship Date," etc.).
11. Junk Dimensions
The Problem: You have 10 separate "flag" columns (e.g.,
Paid_Cash?,Is_Gift?,New_Customer?). If you leave them in the Fact Table, it gets wide and messy. If you make a dimension for each one, you have too many tiny tables.The Fix: Combine them all into a single "Junk Dimension" (often called
Dim_Transaction_Attributes). It contains every possible combination of those flags (e.g., "Cash, Gift, New").
12. Snowflaked Dimensions
AKA: "The Normalization Trap"
If you come from a traditional database background (3NF), your instinct is to avoid duplicate text.
The Impulse: "Why should I store the word 'Pepsi' 10,000 times in the Product table? That's wasteful! I should create a separate
Brandtable and just link to it with an ID."The Result: You turn your simple Star Schema into a complex "Snowflake" shape.
Fact_SalesDim_ProductDim_BrandDim_Manufacturer.
Kimball’s Verdict: 🛑 AVOID.
Kimball argues that saving disk space is irrelevant compared to the cost of User Experience.
Performance: Joining 4 tables is slower than reading 1.
Usability: A business user using a drag-and-drop tool shouldn't have to hunt through 4 different tables just to find the "Manufacturer Name." They expect to find it on the "Product" object.
The Rule: Flatten your hierarchies. Store "Manufacturer," "Brand," and "Category" directly in the Dim_Product table, even if it repeats data.
13. Outrigger Dimensions
AKA: "The Permissible Snowflake"
An Outrigger is a dimension table that is referenced by another dimension table, rather than by the Fact table. It is technically snowflaking, but Kimball allows it in specific cases to maintain consistency.
The Scenario:
You have a standard definition of "County" (Name, Population, Tax Rate). This definition is used by:
Dim_Store (Where is the store located?)
Dim_Customer (Where does the customer live?)
Dim_Warehouse (Where is the inventory?)
If you flatten "County" details into all three tables, you risk them getting out of sync (e.g., the Tax Rate updates in Dim_Store but not Dim_Customer).
The Solution:
Create a single Dim_County table.
Dim_Storehas a foreign key pointing toDim_County.Dim_Customerhas a foreign key pointing toDim_County.The
Dim_Countytable is the "Outrigger" attached to the main dimensions.
Kimball’s Verdict: ⚠️ USE SPARINGLY.
Only use this for data sets that are:
Reused across many different dimensions.
Large/Complex (too big to flatten comfortably).
Standardized (like Geography or Dates).
Pro Tip: The most common Outrigger is a Date Dimension referenced by a Dimension. For example,
Dim_Customermight have aFirst_Purchase_Date_Keythat points to your standardDim_Datetable.
Discuss Fact tables, dimension tables, composite keys(foreign keys) as part of fact tables, granularity of facts and their additivity(additive, semi-additive, non-additive).
Last updated