Wide column stores


Interactive demo about Wide-column stores:

https://claude.ai/public/artifacts/1d1cd6df-4184-49ef-8f8d-dc8532765365arrow-up-right


Wide Column Stores

Wide Column Stores—also known as column-family databases—are NoSQL systems that model data using a flexible, sparse, and highly scalable structure. They originated from Google Bigtable, which formally defines its data model as a sparse, distributed, persistent, multidimensional sorted map.

This means data is indexed and retrieved based on multiple hierarchical dimensions—not only a single key. This data model is the foundation for systems like HBase, Cassandra, ScyllaDB, and Bigtable itself.


Wide Column Stores ≠ Columnar Databases

This distinction is critical:

  • Columnar databases (e.g., Parquet, ClickHouse) store data physically column-by-column to optimize analytics and vectorized scans.

  • Wide column stores are “wide” because each row can have many, varying columns, often sparsely populated.

They are operational NoSQL systems, not analytical column stores.


Key Characteristics

1. Flexible Schema

Rows do not need to have the same columns. This removes the need for NULLs and allows schemas to evolve dynamically.

Example: User A can have 3 attributes, user B can have 200.

2. Column Families

Columns are grouped into column families, which serve two purposes:

  • Logical grouping of related columns

  • Physical separation on disk (each CF is stored independently)

A column is referenced as:

The qualifier acts like a dynamically added column name.

3. Timestamps (Cell Versioning)

Most wide column stores (Bigtable, HBase) maintain multiple versions of the same cell, each identified by a timestamp. This is ideal for:

  • time-series

  • auditing

  • "latest vs historical" reads

  • log-style ingestion

4. High Write Throughput

The storage engine is optimized for write-heavy workloads using:

  • memtables

  • SSTables

  • log-structured merge (LSM) trees

  • eventual consistency (Cassandra) or stronger consistency (HBase, Bigtable)


The Bigtable Data Model: Multidimensional Sorted Map

Bigtable—and systems based on it—store data as a:

map row_key → column_family → column_qualifier → timestamp → value

This forms a 4-dimensional coordinate:

  1. Row key (sorted lexicographically)

  2. Column family

  3. Column qualifier (sorted lexicographically within CF)

  4. Timestamp (sorted descending: newest first)

Why is this important?

It means data is hierarchically indexed and sorted on multiple dimensions, enabling:

  • efficient range scans

  • sparse storage

  • time-based retrieval

  • flexible per-row schemas

Example: Reversed URLs

Google stores web crawl data using:

  • reversed URL as the row key (groups domains together)

  • anchors and content stored under qualifiers

  • multiple versions stored under timestamps

This is the classic example of how Bigtable exploits multidimensional indexing.


Why Wide Column Stores Excel at Certain Use Cases

1. Time-Series Workloads

Rows keyed by device, user, or metric Columns keyed by timestamps Multiple versions per cell if desired

Ideal for logs, metrics, IoT, monitoring.

2. Sparse or Heterogeneous Data

Product catalogs, user profiles, event data—where attributes vary widely.

3. High Write Throughput and Horizontal Scale

LSM-tree storage and peer-to-peer replication (Cassandra) allow:

  • fast ingestion

  • linear scalability

  • high availability


System
Notes

Google Bigtable

The original; powers Search, Gmail, Analytics

HBase

Bigtable-like, CP (strong consistency), runs on Hadoop

Cassandra

AP (high availability), scalable, used by Netflix/Apple

ScyllaDB

C++ reimplementation of Cassandra; much faster

DynamoDB

Managed service; inspired by Dynamo + Bigtable ideas


Last updated