# One Big Table (OBT)

***

[A short video about it](https://www.youtube.com/watch?v=1eB1Sy2rHtk)

***

### Wide Tables / One-Big-Table — What They Are

A **Wide Table** (also called **One Big Table**, **Single Table**, **Flattened Table**, or **Denormalized Table**) is a modeling strategy where **you put almost all fields into a single, very wide, denormalized table**, often with:

* Hundreds or thousands of columns
* Nested structures (JSON, arrays)
* Sparse data (many nulls)
* Very few joins — ideally none

The grain is usually something like:

* order
* customer-day
* session
* event
* product snapshot
* transaction

…and everything related to that grain is embedded directly into the same row.

This approach became popular in the era of cloud columnar databases.

***

### Why Wide Tables Exist (Modern Reasons)

#### Cheap storage

In cloud data warehouses:

* Storage is extremely cheap
* Compute is separate from storage
* It’s cheaper to store denormalized data than engineer a perfect model

This removes the historical pressure behind strict Kimball/Inmon modeling.

***

#### Columnar storage makes sparse data fast

Columnar engines (BigQuery, Snowflake, Redshift, Databricks) read only the columns used in a query.

If 95% of columns are null for most rows:

* null columns cost almost nothing
* column pruning makes scans fast
* you don’t pay to read unused fields

This makes wide schemas efficient — something that relational row-based engines struggled with.

***

#### Flexible and evolving schemas

Modern systems produce:

* JSON
* events
* semi-structured logs
* rapidly-evolving fields

Adding new attributes constantly is common.

In columnar storage:

* adding a column = metadata change
* data for that column is stored only when written

This enables **schema evolution** with almost no overhead.

***

#### Fewer joins = faster queries

Joins are expensive:

* shuffle in massive distributed systems
* potential for skew
* SQL complexity
* operational overhead

A wide table removes joins entirely.

Analytics teams often prefer:

> “Give me everything in one place and let me query.”

This often runs *faster* than joining 10–15 tables.

***

### How Wide Tables Are Built

Typically you ingest raw data and then:

* flatten nested JSON
* join related entities (facts + dimensions)
* explode arrays if necessary
* produce a wide table keyed by a business grain

Over time, engineers append more fields based on new requirements:

| Step | Action                              |
| ---- | ----------------------------------- |
| 1    | Extract raw data                    |
| 2    | Add new columns as new data appears |
| 3    | Denormalize relationships           |
| 4    | Keep building horizontally          |
| 5    | End up with 200–2000 columns        |

This is **incremental and emergent**, not pre-modeled.

***

### Example: Wide Table for Orders

Imagine a table with grain = **(OrderID, CustomerID, OrderDate)**.

It may contain:

* Customer name, email, demographics
* Product info
* Shipping address
* Payment details
* Discounts
* Channel info
* Loyalty tier
* Derived metrics
* Arrays of purchased items
* Nested JSON from events
* Flags, booleans, nulls
* Metadata fields
* Dozens of timestamps

You might end up with a table like:

```
OrderID | CustomerID | OrderDate | TotalAmount | CustomerName | Email |
Phone | CampaignSource | Items (array) | AddressJSON | ... (hundreds of fields)
```

This is perfectly valid in a cloud columnar system.

***

### Strengths of Wide Tables

#### Extreme simplicity

* Analysts can query one place
* No complex joins
* No dimensional modeling
* No conformed dimensions
* No integration headaches

***

#### Great performance for typical BI workloads

Analytical queries often access:

* 5–20 columns
* aggregate by date, customer, product

Columnar pruning makes this extremely efficient.

***

#### Perfect for:

* event-driven systems
* clickstream data
* ecommerce order pipelines
* log analysis
* real-time pipelines
* machine learning feature store–style queries
* denormalized snapshots

***

### Weaknesses and Criticisms

#### Loss of business logic

This is the big one.

Kimball insists on:

* conformed dimensions
* shared business definitions
* carefully-modeled facts

Wide tables ignore this entirely.

This leads to:

* duplicated logic
* inconsistent metric definitions
* unclear lineage
* hard-to-govern schemas
* ambiguous meaning of fields

***

#### Update performance

Updating nested fields or arrays can be slow:

* replacing a nested array may rewrite a whole block
* partial updates are often inefficient

***

#### Harder long-term maintenance

As columns grow:

* documentation becomes unwieldy
* quality varies
* implicit relationships vanish
* governance tools struggle

The table becomes a “junk drawer” for everything.

***

#### Hard to ensure correctness

Without modeling:

* you don’t enforce integrity
* business rules become scattered
* analysts may interpret fields differently

This can cause metric drift.

***

### When Wide Tables Make Sense

Use them when:

* Speed > correctness
* Agility > governance
* Schema evolves rapidly
* Data is semi-structured
* You’re building prototypes or exploratory analytics
* You need very fast BI queries
* The organization is small or doesn’t have formal modeling resources
* Data arrives from streaming sources or events

In many modern orgs, wide tables serve as the **source of marts**, not the final form.

***

### When NOT to Use Them

Avoid them when:

* You need conformed metrics
* Integrity and governance matter
* Many teams depend on consistent definitions
* You have many changing business rules
* You need traceability over time
* Data is used for finance or regulatory reporting

For these, Kimball, Inmon, or DV2.0 → star schemas are safer.

***

### Wide Table as Part of a Modern Stack

Typical pipeline:

**Raw → Wide Table (flattened) → BI Views / Aggregates / ML Features**

or

**Data Vault → Wide Table (Consumption Layer)**

or

**Bronze → Silver (wide tables) → Gold (dimensional models)**\
(in Lakehouse style)

Wide tables often serve as the **middle / flexible zone**.

***

### Summary

Wide tables are:

* denormalized
* massive
* sparse
* columnar-friendly
* schema-evolving
* fast for analytics
* low-governance
* high-flexibility
* low-logic

They represent the **relaxed, cloud-era alternative** to strict modeling.

They're incredibly useful — **in the right contexts** — but cannot replace rigorous modeling where consistency and governance matter.

***

### Concrete Build Pipeline Example (Bronze → Silver → OBT → Gold)

| Layer            | Purpose                                                | Storage / Example                         |
| ---------------- | ------------------------------------------------------ | ----------------------------------------- |
| Bronze           | Raw ingestion, no modeling, immutable                  | Raw JSON/CSV in cloud storage, event logs |
| Silver           | Cleaned, flattened, deduplicated data                  | Flattened Parquet/Delta tables            |
| OBT / Wide Table | Denormalized, wide, query-ready table                  | Columnar table with 100–1000s of columns  |
| Gold             | Business-facing aggregates, BI dashboards, ML features | Star schema, aggregated views, dashboards |

**Step-by-Step Example:**

1. **Bronze:**
   * Raw ingestion from multiple sources.
   * Keep every field and nested structure.
   * Example: raw order events, JSON payloads.
2. **Silver:**
   * Flatten JSON arrays, clean nulls, normalize timestamps.
   * Deduplicate events if necessary.
   * Example: customer order + item details flattened into single table row per item.
3. **Wide Table / OBT:**
   * Denormalize facts + dimensions into one huge table.
   * Include customer info, order info, product info, campaign info, etc.
   * Columnar storage allows sparse columns and fast analytics.
4. **Gold:**
   * Aggregate key metrics for dashboards.
   * Build star schemas or ML-ready feature tables.
   * Example: Sales by region, top products per customer segment.

***
