Wide column stores

Interactive demo about Wide-column stores:

https://claude.ai/public/artifacts/1d1cd6df-4184-49ef-8f8d-dc8532765365

Wide Column Stores

Wide Column Stores—also known as column-family databases—are NoSQL systems that model data using a flexible, sparse, and highly scalable structure. They originated from Google Bigtable, which formally defines its data model as a sparse, distributed, persistent, multidimensional sorted map.

This means data is indexed and retrieved based on multiple hierarchical dimensions—not only a single key. This data model is the foundation for systems like HBase, Cassandra, ScyllaDB, and Bigtable itself.

Wide Column Stores ≠ Columnar Databases

This distinction is critical:

Columnar databases (e.g., Parquet, ClickHouse) store data physically column-by-column to optimize analytics and vectorized scans.
Wide column stores are “wide” because each row can have many, varying columns, often sparsely populated.

They are operational NoSQL systems, not analytical column stores.

Key Characteristics

1. Flexible Schema

Rows do not need to have the same columns. This removes the need for NULLs and allows schemas to evolve dynamically.

Example: User A can have 3 attributes, user B can have 200.

2. Column Families

Columns are grouped into column families, which serve two purposes:

Logical grouping of related columns
Physical separation on disk (each CF is stored independently)

A column is referenced as:

column_family:qualifier

The qualifier acts like a dynamically added column name.

3. Timestamps (Cell Versioning)

Most wide column stores (Bigtable, HBase) maintain multiple versions of the same cell, each identified by a timestamp. This is ideal for:

time-series
auditing
"latest vs historical" reads
log-style ingestion

4. High Write Throughput

The storage engine is optimized for write-heavy workloads using:

memtables
SSTables
log-structured merge (LSM) trees
eventual consistency (Cassandra) or stronger consistency (HBase, Bigtable)

The Bigtable Data Model: Multidimensional Sorted Map

Bigtable—and systems based on it—store data as a:

map row_key → column_family → column_qualifier → timestamp → value

This forms a 4-dimensional coordinate:

Row key (sorted lexicographically)
Column family
Column qualifier (sorted lexicographically within CF)
Timestamp (sorted descending: newest first)

Why is this important?

It means data is hierarchically indexed and sorted on multiple dimensions, enabling:

efficient range scans
sparse storage
time-based retrieval
flexible per-row schemas

Example: Reversed URLs

Google stores web crawl data using:

reversed URL as the row key (groups domains together)
anchors and content stored under qualifiers
multiple versions stored under timestamps

This is the classic example of how Bigtable exploits multidimensional indexing.

Why Wide Column Stores Excel at Certain Use Cases

1. Time-Series Workloads

Rows keyed by device, user, or metric Columns keyed by timestamps Multiple versions per cell if desired

Ideal for logs, metrics, IoT, monitoring.

2. Sparse or Heterogeneous Data

Product catalogs, user profiles, event data—where attributes vary widely.

3. High Write Throughput and Horizontal Scale

LSM-tree storage and peer-to-peer replication (Cassandra) allow:

fast ingestion
linear scalability
high availability

Popular Systems

System

Notes

Google Bigtable

The original; powers Search, Gmail, Analytics

HBase

Bigtable-like, CP (strong consistency), runs on Hadoop

Cassandra

AP (high availability), scalable, used by Netflix/Apple

ScyllaDB

C++ reimplementation of Cassandra; much faster

DynamoDB

Managed service; inspired by Dynamo + Bigtable ideas

PreviousTypes of NoSQL databases NextOverview of common DBMS

Last updated 2 months ago

hashtagWide Column Stores

hashtagWide Column Stores ≠ Columnar Databases

hashtagKey Characteristics

hashtagThe Bigtable Data Model: Multidimensional Sorted Map

hashtagWhy Wide Column Stores Excel at Certain Use Cases

hashtagPopular Systems