Core concepts from the book "The Data Warehouse Toolkit, 3rd ed."


According to Ralph Kimball, the founder of the dimensional modeling approach, there are four key decisions (or steps) that must be made during the design of a dimensional model. These steps must be performed in this specific order to ensure the model accurately reflects business requirements.

The 4-Step Dimensional Design Process

1. Select the Business Process

The first step is to identify the specific operational activity you want to model. A business process is usually an event that your organization performs, such as "taking an order," "registering a student," or "processing a claim."

  • Key Note: Do not equate a business process with a business department (like Marketing or HR). Focus on the action/event.

2. Declare the Grain

This is widely considered the most critical step. You must define exactly what a single row in the fact table represents. This is the "atomic" level of your data.

  • Key Note: If you skip this step, you risk creating a model with mismatched statistics (e.g., trying to store annual budget data in the same table as daily sales transactions). Common examples of grain include "one row per line item on an invoice" or "one row per daily snapshot."

3. Identify the Dimensions

Once the grain is declared, you determine the "who, what, where, when, why, and how" of the business process. Dimensions provide the context for the data and are typically where the descriptive text attributes live (e.g., Customer Name, Product Category, Store Location, Transaction Date).

  • Key Note: These are the fields your users will use to filter and group their reports.

4. Identify the Facts

The final step is to identify the numeric measurements that result from the business process. These are the numbers you want to analyze, such as "Quantity Sold," "Extended Price," or "Transaction Duration."

  • Key Note: All facts must be consistent with the grain declared in Step 2. If a fact belongs to a different level of granularity (like a monthly target), it must go in a separate fact table.

Summary for Quick Reference

Step

Action

Question to Ask

1

Select Business Process

What is the operational event?

2

Declare the Grain

What does one row represent?

3

Identify Dimensions

How do we describe the data (Who, What, Where)?

4

Identify Facts

What are we measuring (Counts, Sums, Averages)?


Basic Fact Table Techniques

Here is the breakdown of the Basic Fact Table Techniques. These are the "building blocks" of dimensional modeling. If you master these definitions, you can handle almost any data scenario.

NOTE The designer’s dilemma of whether a numeric quantity is a fact or a dimension attribute is rarely a difficult decision. Continuously valued numeric observations are almost always facts; discrete numeric observations drawn from a small list are almost always dimension attributes.

1. Fact Table Structure

The physical layout of a fact table is very specific and lean. A fact table generally has only two types of columns:

  • Foreign Keys (FKs): These point to your Dimension tables (e.g., Product_ID, Date_ID, Store_ID).

  • Facts (Measures): The numeric data you want to analyze (e.g., Sales_Amount, Quantity_Sold).

  • The Rule: Fact tables contain quantitative data; Dimension tables contain descriptive data. Fact tables are deep (millions of rows) but narrow; Dimension tables are shallow (fewer rows) but wide.

2. Additive, Semi-Additive, Non-Additive Facts

This is a critical concept for preventing calculation errors in reports. You must know "when can I sum this column?"

  • Additive: Can be summed across any dimension.

    • Example: Sales_Amount. (Sales today + Sales tomorrow = Total Sales. Sales in NY + Sales in London = Total Sales).

  • Semi-Additive: Can be summed across some dimensions, but not others (usually Time).

    • Example: Account_Balance or Inventory_Level. You can sum all bank accounts to get the "Total Cash in Bank," but you cannot sum Balance across time (Balance on Jan 1st + Balance on Jan 2nd Total Balance).

  • Non-Additive: Cannot be summed at all.

    • Example: Unit_Price or Margin_Percentage. You can't sum prices; you have to average them.

3. Nulls in Fact Tables

Kimball has a very strict rule here: Avoid Nulls in Foreign Keys.

  • The Problem: If a transaction comes in without a Customer ID, and you leave it as NULL, your SQL joins will fail or behave unpredictably (inner joins drop the row).

  • The Fix: Replace the NULL with a "dummy" dimension key, such as -1 or 0, which points to a row in the Customer Dimension labeled "Unknown Customer" or "N/A". This ensures you never lose data in a report.

4. Conformed Facts

Just like "Conformed Dimensions," you must standardize your math.

  • Concept: If you have a measure called Revenue in the "Sales Fact" table and a measure called Revenue in the "Executive Dashboard" table, they must be calculated exactly the same way.

  • Why: If the CEO sees "Revenue: $1M" on one report and "Revenue: $900k" on another because one included tax and the other didn't, trust in the data team is destroyed.

5. The Three Core Fact Tables

  • Transaction: One row per event (e.g., one beep at the checkout). High volume, most detailed.

  • Periodic Snapshot: One row per period (e.g., daily inventory summary). Good for trends.

  • Accumulating Snapshot: One row per workflow (e.g., an Order). Has multiple dates (Order_Date, Ship_Date) in one row. Good for measuring lag/duration.

chevron-rightThree Fundamental Fact Table Typeshashtag

1. Transaction Fact Tables (Read Chapter 3: Retail Sales)

  • What it is: Data that records an event at a specific point in time (e.g., a scan at a checkout counter).

  • Key Concept: This is the "Hello World" of data modeling. It introduces Denormalization (flattening many tables into one Dimension).

2. Periodic Snapshot Fact Tables (Read Chapter 4: Inventory)

  • What it is: Data that measures the "status" of something at regular intervals (e.g., "How many items were in the warehouse at the end of the day?").

  • Key Concept: Semi-additive facts. You can sum inventory across products, but you can't sum it across time (Inventory on Monday + Inventory on Tuesday Total Inventory).

3. Accumulating Snapshot Fact Tables (Read Chapter 5: Procurement)

  • What it is: Data that tracks a workflow with a start and end (e.g., An Order: Placed Shipped Delivered).

  • Key Concept: This table has multiple date columns (OrderDate, ShipDate, DeliveryDate) in one row. This is crucial for calculating "lag" (how long did it take to ship?).

6. Factless Fact Tables

A "Factless" fact table is a table that has only Foreign Keys and no numeric measures.

  • Use Case 1: Event Tracking.

    • Scenario: A student attends a class. There is no "amount" to sum. You just need to record that Student A was in Class B on Date C.

    • The Table: Student_ID, Class_ID, Date_ID. If a row exists, the event happened.

  • Use Case 2: Coverage/Eligibility.

    • Scenario: Which products were on promotion today? Even if nothing sold, you need to know the product was eligible for the promo.

    • The Table: Product_ID, Promo_ID, Date_ID.

7. Aggregate Fact Tables (OLAP Cubes)

  • Concept: These are smaller, physically separate tables that store pre-calculated summaries to speed up queries.

  • Example: Instead of summing 1 billion rows of Transaction_Fact every time someone asks for "Monthly Sales," you create a Monthly_Sales_Fact table that only has 1 row per month.

  • Modern Note: As mentioned before, modern tools (dbt, materialized views) automate this, but the logic remains essential for performance.

8. Consolidated Fact Tables

  • Concept: Combining facts from different processes into one table for convenience.

  • Example: Actual vs. Budget.

    • Process A generates "Actual Sales" (Transaction Fact).

    • Process B generates "Budget/Forecast" (Periodic Snapshot Fact).

    • To compare them, you create a Consolidated Fact Table that aligns them on common dimensions (Year, Region, Product) so you can calculate Variance = Actual - Budget.


Dimension table techniques

Basic Dimension Table Techniques. These are the rules for how to build the "context" around your data.

1. Dimension Table Structure

Dimension tables are "short and wide." They have relatively fewer rows than fact tables (thousands vs. millions), but many columns (attributes).

  • Purpose: To provide the "Who, What, Where, When, Why" context for the numbers in the fact table.

  • The Rule: Dimension tables should be denormalized. Do not try to save space by creating sub-tables (snowflaking). If you have a "Product" dimension, include the "Category Name" directly in that table as a string. Do not create a separate "Category" table.

2. Dimension Surrogate Keys

  • What it is: A simple integer (1, 2, 3...) created by the data warehouse to uniquely identify a row in a dimension table.

  • Why use it?

    • Independence: If the operational system changes its IDs (e.g., switches from integer IDs to UUIDs), your warehouse doesn't break.

    • Performance: Joining on integers is faster than joining on long strings.

    • History (SCDs): If a customer changes their name, you need two rows for that customer (one for the old name, one for the new). You cannot do this if you rely on the single "Customer ID" from the source system. You need your own unique Surrogate Key for each row.

3. Natural, Durable, and Supernatural Keys

  • Natural Key: The ID from the real world/source system (e.g., Social Security Number, SKU, Email Address).

  • Durable Key: A key that never changes for a business entity, even if the source system changes. Often, the Data Warehouse team has to maintain a mapping table to ensure "John Doe" is always "Person 123" even if he switches from system A to system B.

  • Supernatural Key: A whimsical term Kimball uses for keys generated solely within the warehouse to handle permanent identification when source systems are unreliable.

4. Drilling Down

  • Concept: This is the primary action a user takes in a BI tool. "Drilling down" means grouping by a more specific attribute in a dimension.

  • Example: You start by looking at Sales by Region (North, South). You see a spike in the North. You "drill down" by adding the State attribute (NY, NJ, PA) to see exactly where the spike came from.

  • Requirement: For drill-down to work, you need hierarchical attributes (Region State City Zip) stored in the same dimension table.

5. Degenerate Dimensions

  • What it is: A dimension key that sits in the Fact Table but does not join to a dimension table.

  • Common Use Case: Transaction Numbers (Invoice ID, Order ID).

  • Why: An Invoice ID is unique to the transaction. If you built a "Dimension Table" for it, it would have 1 billion rows (same as the fact table) and no other useful columns. So, we leave the ID "stranded" in the Fact Table. It's still useful for filtering ("Find Order #555"), but it doesn't need its own table.

6. Denormalized Flattened Dimensions

  • Concept: As mentioned in "Structure," this is the practice of collapsing hierarchies into one table.

  • Visual: instead of Product Table Brand Table Category Table, you just have one big Product Table with columns: Product_Name, Brand_Name, Category_Name.

7. Multiple Hierarchies in Dimensions

  • The Problem: Sometimes a dimension has two valid paths.

    • Calendar: Day Month Year

    • Fiscal: Day Fiscal Period Fiscal Year

  • The Fix: Just store them both as columns in the same Date Dimension table. Do not create separate tables. Let the user choose which hierarchy they want to drag onto the report.

8. Flags and Indicators as Textual Attributes

  • The Rule: Avoid cryptic Boolean flags like 0 or 1.

  • The Fix: Translate them into meaningful text.

    • Instead of Is_Active: 1, store Status: "Active".

    • Instead of Has_churned: 0, store Churn_Status: "Retained".

  • Benefit: It makes dropdown filters in dashboards self-explanatory.

9. Null Attributes in Dimensions

  • The Rule: Just like in Fact Tables, avoid NULL in dimension attributes.

  • Why: If a user filters for "Category NOT EQUAL TO 'Hardware'", SQL logic will usually exclude NULL rows too, which might not be what the user intended.

  • The Fix: Replace NULL with a descriptive string like "Unknown", "Not Applicable", or "Pending".

10. Role-Playing Dimensions

  • Concept: Using the same physical dimension table multiple times in a single query for different purposes.

  • Example: A Fact Table has Order_Date_Key, Ship_Date_Key, and Delivery_Date_Key. You do not build three date tables. You build one Dim_Date table and join to it three times (aliasing it as "Order Date," "Ship Date," etc.).

11. Junk Dimensions

  • The Problem: You have 10 separate "flag" columns (e.g., Paid_Cash?, Is_Gift?, New_Customer?). If you leave them in the Fact Table, it gets wide and messy. If you make a dimension for each one, you have too many tiny tables.

  • The Fix: Combine them all into a single "Junk Dimension" (often called Dim_Transaction_Attributes). It contains every possible combination of those flags (e.g., "Cash, Gift, New").

12. Snowflaked Dimensions

AKA: "The Normalization Trap"

If you come from a traditional database background (3NF), your instinct is to avoid duplicate text.

  • The Impulse: "Why should I store the word 'Pepsi' 10,000 times in the Product table? That's wasteful! I should create a separate Brand table and just link to it with an ID."

  • The Result: You turn your simple Star Schema into a complex "Snowflake" shape.

    • Fact_Sales Dim_Product Dim_Brand Dim_Manufacturer.

Kimball’s Verdict: 🛑 AVOID.

Kimball argues that saving disk space is irrelevant compared to the cost of User Experience.

  • Performance: Joining 4 tables is slower than reading 1.

  • Usability: A business user using a drag-and-drop tool shouldn't have to hunt through 4 different tables just to find the "Manufacturer Name." They expect to find it on the "Product" object.

The Rule: Flatten your hierarchies. Store "Manufacturer," "Brand," and "Category" directly in the Dim_Product table, even if it repeats data.


13. Outrigger Dimensions

AKA: "The Permissible Snowflake"

An Outrigger is a dimension table that is referenced by another dimension table, rather than by the Fact table. It is technically snowflaking, but Kimball allows it in specific cases to maintain consistency.

The Scenario:

You have a standard definition of "County" (Name, Population, Tax Rate). This definition is used by:

  1. Dim_Store (Where is the store located?)

  2. Dim_Customer (Where does the customer live?)

  3. Dim_Warehouse (Where is the inventory?)

If you flatten "County" details into all three tables, you risk them getting out of sync (e.g., the Tax Rate updates in Dim_Store but not Dim_Customer).

The Solution:

Create a single Dim_County table.

  • Dim_Store has a foreign key pointing to Dim_County.

  • Dim_Customer has a foreign key pointing to Dim_County.

  • The Dim_County table is the "Outrigger" attached to the main dimensions.

Kimball’s Verdict: ⚠️ USE SPARINGLY.

Only use this for data sets that are:

  1. Reused across many different dimensions.

  2. Large/Complex (too big to flatten comfortably).

  3. Standardized (like Geography or Dates).

Pro Tip: The most common Outrigger is a Date Dimension referenced by a Dimension. For example, Dim_Customer might have a First_Purchase_Date_Key that points to your standard Dim_Date table.



Discuss Fact tables, dimension tables, composite keys(foreign keys) as part of fact tables, granularity of facts and their additivity(additive, semi-additive, non-additive).


Last updated