Types of data and their formats

General overview of data storage formats

A comparison tables of data serialization formats from Wikipedia.

Article about advantages of Apache Avro for schema management: https://www.oreilly.com/content/the-problem-of-managing-schemas/

Comparison of schema evolution between Avro, Protobuf, and Thrift: https://martin.kleppmann.com/2012/12/05/schema-evolution-in-avro-protocol-buffers-thrift.html

Understanding Data Formats: A Comprehensive Guide

When working with data, it's important to distinguish between encoding (how data is represented as bytes) and storage formats (how those bytes are organized on disk). Encoding is also called serialization or marshalling—the process of converting in-memory data structures into byte sequences. The reverse is decoding, deserialization, or unmarshalling.

A Note on Terminology

The term "serialization" appears in two completely different contexts. In the data world, it means converting data structures to byte sequences. In the database world, it refers to transaction isolation guarantees. To avoid confusion, many prefer the terms "encoding" and "decoding" when discussing data formats.

Data Structure Classification

Data can be classified into three fundamental categories:

Unstructured Data:

No predefined schema or organization: images, videos, audio, PDFs, raw text
Requires ML/NLP to extract meaning
Stored as raw binary files in object storage (S3, blob storage)

Semi-Structured Data:

Has organizational structure but flexible schema: JSON, XML, CSV
Self-describing with field names embedded in data
Schema can vary between records
Human-readable and somewhat queryable

Structured Data:

Rigid, predefined schema with enforced types: relational databases, Parquet, Avro
All records conform to same structure
Strong typing enforced
Optimized for queries and analytics

Most data engineering pipelines follow an evolution: Semi-Structured (ingestion) → Structured (storage/analytics), trading flexibility for efficiency.

Human-Readable Formats (Semi-Structured)

Text-based formats that can be opened in any text editor, excellent for debugging and version control. They're larger and slower to parse than binary formats but much easier to work with.

JSON - ubiquitous for web APIs and configuration files, nested structures, flexible schema
XML - robust schema validation, hierarchical, common in enterprise systems
YAML - human-friendly with minimal syntax, popular for configuration
CSV - simple tabular data, implicit schema

Schema characteristics:

Schema-less or implicit—no enforced structure
Field names repeated in every record (storage overhead)
Data types are text strings requiring parsing: "123" could be string or number
No validation until read time
Maximum flexibility for evolving schemas

Compression:

Can be compressed with GZIP/Zstandard (typically 5x compression)
Text overhead remains: numbers and dates stored as text, not binary
Mixed data types prevent optimal compression
Common pattern: .json.gz, .csv.gz for temporary storage

Use cases:

API responses and requests
Configuration files
Ingestion/landing zones in data lakes
Debugging and development
Small datasets where simplicity matters

Binary Formats (Structured/Semi-Structured)

Not human-readable but compact and fast to parse. Require specialized tools or libraries to inspect and work with.

General Purpose Serialization

Protocol Buffers (Protobuf):

Structure: Structured (strongly typed, schema-enforced)
Orientation: Row-oriented (single messages)
Schema: External .proto files, field numbering for evolution
Use case: RPC/gRPC, microservices communication, event streaming (Kafka)
Strengths: Extremely compact, fast serialization, backward/forward compatibility
Limitations: Not optimized for analytical queries, no built-in indexing
Compression: Already compact; additional compression optional
Pipeline role: Transport and interchange—ingestion from applications, message passing

Apache Avro:

Structure: Structured (schema-enforced but with evolution support)
Orientation: Row-oriented
Schema: Embedded in file or external registry (JSON-defined)
Use case: Streaming pipelines (Kafka), evolving schemas, data serialization
Strengths: Excellent schema evolution (forward/backward compatible), compact, self-describing
Compression: Supports Snappy, GZIP, deflate; compresses well due to schema separation
Pipeline role: Streaming layer—handles schema changes gracefully over time

MessagePack:

Structure: Semi-structured (flexible like JSON)
Orientation: Row-oriented
Schema: Schema-less (like JSON but binary)
Use case: Binary alternative to JSON for APIs, faster serialization
Strengths: Drop-in JSON replacement, faster and smaller
Compression: More compact than JSON; benefits from additional compression

BSON:

Structure: Semi-structured (JSON-like with types)
Orientation: Row-oriented (documents)
Schema: Schema-less but type-aware
Use case: MongoDB, document databases
Strengths: Additional data types (dates, binary), traversable without parsing entire document
Compression: Less compact than MessagePack; used for queryability not storage efficiency

Columnar Formats (Analytics-Optimized)

Apache Parquet:

Structure: Structured (strict schema, strong typing)
Orientation: Columnar—stores each column's values together
Schema: Embedded in file metadata with column statistics (min/max, null counts)
Use case: Data lakes, analytical storage, Spark/Hive workloads, long-term storage
Strengths:
- Excellent compression (10-100x): columnar layout + encoding (dictionary, RLE)
- Query optimization: column pruning, predicate pushdown, statistics-based skipping
- Only read columns you need
- Industry standard for analytics
Compression: Snappy (fast), GZIP (high ratio), LZ4, Zstandard
Access pattern: Read few columns from many rows (analytical queries)
Pipeline role: Storage layer—optimized for fast analytical queries

ORC (Optimized Row Columnar):

Structure: Structured (strict schema)
Orientation: Columnar with lightweight indexes
Schema: Embedded with rich type system
Use case: Hive-heavy environments, maximum compression needs
Strengths: Similar to Parquet but often better compression, built-in bloom filters
Compression: ZLIB, Snappy, LZ4, Zstandard
Pipeline role: Alternative to Parquet, especially in Hadoop ecosystems

Apache Arrow:

Structure: Structured (typed columnar)
Orientation: Columnar in-memory format
Schema: Standardized memory layout
Use case: Zero-copy data sharing between processes (Pandas, Spark, databases)
Strengths:
- No serialization overhead between compatible systems
- Fast in-memory analytics
- Language-agnostic standard
Compression: Not primarily a storage format, but supports compression in flight
Pipeline role: In-memory interchange—fast data transfer between tools without conversion

Specialized Binary Formats

Media formats:

Images: JPEG (lossy), PNG (lossless), WebP
Audio: MP3/AAC (lossy), FLAC (lossless)
Video: MP4, MKV containers with H.264/H.265 codecs (lossy)

Archives:

ZIP, TAR, GZIP, 7z: Container formats with compression

Row-Oriented vs. Column-Oriented: The Critical Distinction

This is the fundamental divide in data engineering formats:

Row-Oriented (JSON, CSV, Avro, Protobuf):

Store complete records together: [id, name, age, salary], [id, name, age, salary], ...
Best for: Transactional workloads (OLTP), writing individual records, reading entire rows
Use case: Application logs, event streams, APIs where you need all fields
Example: Insert user record, stream events from Kafka

Column-Oriented (Parquet, ORC, Arrow):

Store each column separately: [all ids], [all names], [all ages], [all salaries]
Best for: Analytical workloads (OLAP), aggregations, reading subset of columns
Use case: Data warehouses, analytics on wide tables
Example: "Calculate average salary by department"—touches 2 of 50 columns

Performance impact:

Scenario: 100 columns, 1 billion rows, query needs 2 columns
Row format: Read all 100 columns × 1B rows, discard 98%
Column format: Read only 2 columns × 1B rows = 50x less I/O

The Role of Schema

Schema is the foundation of efficiency differences between formats:

Schema-less/Implicit (JSON, CSV):

Field names repeated in every record: {"name":"Alice"} millions of times
Numbers stored as text: "12345" is 5 bytes instead of 4
No type enforcement or validation until read time
Maximum flexibility but storage/query overhead

Schema-enforced (Parquet, Avro, Protobuf):

Schema stored once (file header or registry)
Field names replaced by positions/IDs
Binary type encoding: integers as 4 bytes, not text
Query engines use schema for optimization (column pruning, predicate pushdown)
Validation at write time

Schema evolution:

Avro: Forward/backward compatible, fields can be added/removed safely
Protobuf: Field numbering allows deprecation without breaking compatibility
Parquet: Can add columns without rewriting existing data
JSON/CSV: No formal evolution, just handle inconsistencies

Compression Deep Dive

Compression reduces storage, speeds up I/O, and lowers network costs. The choice depends on your access patterns.

Lossless Compression

Preserves every bit of original data:

GZIP - excellent compression ratios (5-10x), slower, standard for text
Snappy - moderate compression (2-4x), very fast, popular in Parquet
LZ4 - lower compression (2-3x), extremely fast decompression
Zstandard (zstd) - modern algorithm balancing speed and ratio, increasingly popular

When to use:

High ratio (GZIP, zstd): Compress once, read many times (archival, cold storage)
Fast (Snappy, LZ4): Frequent writes and reads (hot data, real-time pipelines)

Columnar advantage:

Similar data compresses better: column of dates compresses far better than mixed row data
Dictionary encoding: 1M records with 100 unique names → store mapping once + indices
Run-length encoding: repeated values stored as count + value

Lossy Compression

Discards less important information for dramatic size reduction:

JPEG - removes visual details humans can't perceive
MP3/AAC - removes inaudible frequencies
H.264/H.265 - video codecs with temporal/spatial compression

Only acceptable for media formats where some quality loss is tolerable.

Compression Comparison

For 1 million records with 10 columns:

CSV uncompressed: ~500 MB
CSV + GZIP: ~100 MB (5x)
JSON uncompressed: ~800 MB
JSON + GZIP: ~150 MB (5x)
Parquet + Snappy: ~50 MB (10x)
Parquet + GZIP: ~30 MB (16x)

Why Parquet compresses better:

Binary encoding (no text overhead)
Columnar layout (similar data together)
Schema separation (no repeated field names)
Specialized encodings (dictionary, RLE)

Format Selection in Data Engineering Pipelines

Typical Pipeline Flow

Source → Ingestion → Processing → Storage → Analytics
   ↓         ↓           ↓           ↓          ↓
 JSON    Kafka/Avro   Spark/Arrow  Parquet   SQL Queries
Semi-    Structured   In-memory   Columnar   Optimized
structured   Stream    Transform   Storage    Queries

Decision Framework

Ingestion Layer (Landing Zone):

Use: JSON, CSV, Avro
Why: Flexibility for diverse sources, easy ingestion, schema evolution
Compression: GZIP for temporary storage
Pattern: Accept data as-is, preserve raw format

Streaming Layer:

Use: Avro, Protobuf
Why: Schema enforcement, compact, handles evolution, fast serialization
Platform: Kafka, Pub/Sub with schema registry
Pattern: Typed messages with versioned schemas

Processing Layer:

Use: Arrow in-memory
Why: Zero-copy between Spark/Pandas/Dask, fast transformations
Pattern: Minimize serialization overhead during multi-step processing

Storage Layer (Data Lake/Warehouse):

Use: Parquet, ORC
Why: Columnar compression, query optimization, long-term efficiency
Compression: Snappy for hot data, GZIP for cold storage
Pattern: Partitioned by date/category for query pruning

Analytics Layer:

Query: SQL on Parquet/ORC
Benefits: Column pruning, predicate pushdown, statistics-based skipping
Performance: 50-100x faster than parsing JSON

Specific Use Cases

Real-time event streaming:

Application → Protobuf → Kafka → Stream processor → Parquet → Data lake

Protobuf: Compact messages for transport
Parquet: Analytics-optimized storage

Batch ETL with schema evolution:

API → JSON → Validate → Avro → Spark → Parquet

JSON: Flexible ingestion
Avro: Handle schema changes during transformation
Parquet: Final analytical storage

Multi-tool analytics:

Parquet → Arrow (in-memory) → [Python/R/Java tools] → Results

Parquet: Efficient storage
Arrow: Zero-copy sharing between languages

Long-term archival:

Production DB → Avro (schema preserved) → GZIP → Cold storage

Avro: Self-describing, schema evolution support
GZIP: Maximum compression for rarely accessed data

Key Takeaways

Structure matters: Semi-structured (JSON) for flexibility, structured (Parquet) for efficiency
Orientation matters: Row-oriented for transactions, column-oriented for analytics
Schema is critical: Enables compression, validation, and query optimization
Compression strategy: Balance ratio vs. speed based on access patterns
Pipeline evolution: Semi-structured → Structured as data moves from ingestion to analytics
Specialized formats exist for a reason: Parquet/ORC for analytics, Avro/Protobuf for streaming, Arrow for in-memory

The "right" format depends on your specific use case: access patterns, query types, data volume, schema stability, and position in the pipeline. Modern data platforms often use multiple formats, each optimized for its role in the overall architecture.

PreviousSQL flavors (dialects)NextWhat does zero-copy mean?

Last updated 27 days ago

hashtagUnderstanding Data Formats: A Comprehensive Guide

hashtagA Note on Terminology

hashtagData Structure Classification

hashtagHuman-Readable Formats (Semi-Structured)

hashtagBinary Formats (Structured/Semi-Structured)

hashtagGeneral Purpose Serialization

hashtagColumnar Formats (Analytics-Optimized)

hashtagSpecialized Binary Formats

hashtagRow-Oriented vs. Column-Oriented: The Critical Distinction

hashtagThe Role of Schema

hashtagCompression Deep Dive

hashtagLossless Compression

hashtagLossy Compression

hashtagCompression Comparison

hashtagFormat Selection in Data Engineering Pipelines

hashtagTypical Pipeline Flow

hashtagDecision Framework

hashtagSpecific Use Cases

hashtagKey Takeaways