OLTP/OLAP systems


Storage Layout: Row vs. Column

The physical orientation of data on disk determines which access patterns are "cheap" and which are "expensive."

Row-Oriented (Horizontal Partitioning)

  • Physical Layout: Fields of the same record are stored contiguously in a single page.

    • Example: [ID1, Name1, Age1], [ID2, Name2, Age2]

  • Best For: Point Lookups (Accessing a specific record by ID) and Heavy Writes (Appending a full record).

  • Advantages: * One I/O seek retrieves the entire record.

    • Simple indexing (Primary Keys point to a specific file offset).

  • Disadvantages: * "I/O Bloat": To calculate AVG(age), the DB must read Name and ID into memory even though they aren't needed.

    • Lower compression due to mixed data types in a single block.

Column-Oriented (Vertical Partitioning)

  • Physical Layout: Values of the same column are stored together across pages.

    • Example: Names: [Alice, Bob]; Ages: [30, 25]

  • Best For: Aggregations (SUM, AVG) and Broad Scans on a subset of columns.

  • Advantages:

    • High Compression: Storing same-type data allows for Run-Length Encoding (RLE) or Dictionary Encoding.

    • Vectorized Processing: Modern CPUs can process arrays of values (SIMD) much faster than mixed-type rows.

    • I/O Efficiency: Only the specific columns requested in the SELECT clause are read from disk.

  • Disadvantages:

    • Tuple Reconstruction: Combining columns to show a full record is computationally expensive.

    • Write Penalty: Inserting one record requires updating multiple separate column files/pages.


System Archetypes: OLTP vs. OLAP

These storage methods serve two fundamentally different business needs.

OLTP (Online Transaction Processing)

  • Storage: Row-oriented.

  • Workload: Thousands of small, fast updates/reads (e.g., "Change Order Status," "Get User Profile").

  • Schema: Normalized (3NF) to ensure data integrity and minimize redundancy.

  • Focus: Low latency and high concurrency (ACID is non-negotiable).

OLAP (Online Analytical Processing)

  • Storage: Column-oriented.

  • Workload: Few, massive queries scanning millions of rows (e.g., "Year-over-year revenue by region").

  • Schema: Denormalized (Star/Snowflake) to reduce the number of joins.

  • Focus: Throughput and complex query performance.


Data Modeling for Analytics

When moving data from OLTP to OLAP, we reshape it for easier querying.

  • Star Schema: A central Fact Table (quantitative metrics) surrounded by Dimension Tables (descriptive attributes). This is the gold standard for performance.

  • Snowflake Schema: A variation where dimensions are normalized. It saves space but increases join complexity.

  • HTAP (Hybrid Transactional/Analytical Processing): Modern systems (like TiDB or SingleStore) that maintain both a row-store and a column-store of the same data simultaneously, allowing "real-time analytics" without an ETL delay.


Summary Table for Quick Reference

Feature

OLTP / Row-Store

OLAP / Column-Store

Primary Goal

Execution of transactions

Business Intelligence / Insights

Data Update

Frequent, small writes

Batch loads, rare updates

Query Pattern

Simple (JOINs on PKs)

Complex (Aggregations, Window Functions)

Compression

Moderate

Exceptional (10x - 100x)

Bottleneck

Disk Latency / Contention

Disk I/O Bandwidth / CPU


Last updated