Wide column stores
Interactive demo about Wide-column stores:
https://claude.ai/public/artifacts/1d1cd6df-4184-49ef-8f8d-dc8532765365
Wide Column Stores
Wide Column Stores—also known as column-family databases—are NoSQL systems that model data using a flexible, sparse, and highly scalable structure. They originated from Google Bigtable, which formally defines its data model as a sparse, distributed, persistent, multidimensional sorted map.
This means data is indexed and retrieved based on multiple hierarchical dimensions—not only a single key. This data model is the foundation for systems like HBase, Cassandra, ScyllaDB, and Bigtable itself.
Wide Column Stores ≠ Columnar Databases
This distinction is critical:
Columnar databases (e.g., Parquet, ClickHouse) store data physically column-by-column to optimize analytics and vectorized scans.
Wide column stores are “wide” because each row can have many, varying columns, often sparsely populated.
They are operational NoSQL systems, not analytical column stores.
Key Characteristics
1. Flexible Schema
Rows do not need to have the same columns. This removes the need for NULLs and allows schemas to evolve dynamically.
Example: User A can have 3 attributes, user B can have 200.
2. Column Families
Columns are grouped into column families, which serve two purposes:
Logical grouping of related columns
Physical separation on disk (each CF is stored independently)
A column is referenced as:
The qualifier acts like a dynamically added column name.
3. Timestamps (Cell Versioning)
Most wide column stores (Bigtable, HBase) maintain multiple versions of the same cell, each identified by a timestamp. This is ideal for:
time-series
auditing
"latest vs historical" reads
log-style ingestion
4. High Write Throughput
The storage engine is optimized for write-heavy workloads using:
memtables
SSTables
log-structured merge (LSM) trees
eventual consistency (Cassandra) or stronger consistency (HBase, Bigtable)
The Bigtable Data Model: Multidimensional Sorted Map
Bigtable—and systems based on it—store data as a:
map row_key → column_family → column_qualifier → timestamp → value
This forms a 4-dimensional coordinate:
Row key (sorted lexicographically)
Column family
Column qualifier (sorted lexicographically within CF)
Timestamp (sorted descending: newest first)
Why is this important?
It means data is hierarchically indexed and sorted on multiple dimensions, enabling:
efficient range scans
sparse storage
time-based retrieval
flexible per-row schemas
Example: Reversed URLs
Google stores web crawl data using:
reversed URL as the row key (groups domains together)
anchors and content stored under qualifiers
multiple versions stored under timestamps
This is the classic example of how Bigtable exploits multidimensional indexing.
Why Wide Column Stores Excel at Certain Use Cases
1. Time-Series Workloads
Rows keyed by device, user, or metric Columns keyed by timestamps Multiple versions per cell if desired
Ideal for logs, metrics, IoT, monitoring.
2. Sparse or Heterogeneous Data
Product catalogs, user profiles, event data—where attributes vary widely.
3. High Write Throughput and Horizontal Scale
LSM-tree storage and peer-to-peer replication (Cassandra) allow:
fast ingestion
linear scalability
high availability
Popular Systems
Google Bigtable
The original; powers Search, Gmail, Analytics
HBase
Bigtable-like, CP (strong consistency), runs on Hadoop
Cassandra
AP (high availability), scalable, used by Netflix/Apple
ScyllaDB
C++ reimplementation of Cassandra; much faster
DynamoDB
Managed service; inspired by Dynamo + Bigtable ideas
Last updated