Types of data and their formats
General overview of data storage formats
A comparison tables of data serialization formats from Wikipedia.
Article about advantages of Apache Avro for schema management: https://www.oreilly.com/content/the-problem-of-managing-schemas/
Comparison of schema evolution between Avro, Protobuf, and Thrift: https://martin.kleppmann.com/2012/12/05/schema-evolution-in-avro-protocol-buffers-thrift.html
Understanding Data Formats: A Comprehensive Guide
When working with data, it's important to distinguish between encoding (how data is represented as bytes) and storage formats (how those bytes are organized on disk). Encoding is also called serialization or marshalling—the process of converting in-memory data structures into byte sequences. The reverse is decoding, deserialization, or unmarshalling.
A Note on Terminology
The term "serialization" appears in two completely different contexts. In the data world, it means converting data structures to byte sequences. In the database world, it refers to transaction isolation guarantees. To avoid confusion, many prefer the terms "encoding" and "decoding" when discussing data formats.
Data Structure Classification
Data can be classified into three fundamental categories:
Unstructured Data:
No predefined schema or organization: images, videos, audio, PDFs, raw text
Requires ML/NLP to extract meaning
Stored as raw binary files in object storage (S3, blob storage)
Semi-Structured Data:
Has organizational structure but flexible schema: JSON, XML, CSV
Self-describing with field names embedded in data
Schema can vary between records
Human-readable and somewhat queryable
Structured Data:
Rigid, predefined schema with enforced types: relational databases, Parquet, Avro
All records conform to same structure
Strong typing enforced
Optimized for queries and analytics
Most data engineering pipelines follow an evolution: Semi-Structured (ingestion) → Structured (storage/analytics), trading flexibility for efficiency.
Human-Readable Formats (Semi-Structured)
Text-based formats that can be opened in any text editor, excellent for debugging and version control. They're larger and slower to parse than binary formats but much easier to work with.
JSON - ubiquitous for web APIs and configuration files, nested structures, flexible schema
XML - robust schema validation, hierarchical, common in enterprise systems
YAML - human-friendly with minimal syntax, popular for configuration
CSV - simple tabular data, implicit schema
Schema characteristics:
Schema-less or implicit—no enforced structure
Field names repeated in every record (storage overhead)
Data types are text strings requiring parsing:
"123"could be string or numberNo validation until read time
Maximum flexibility for evolving schemas
Compression:
Can be compressed with GZIP/Zstandard (typically 5x compression)
Text overhead remains: numbers and dates stored as text, not binary
Mixed data types prevent optimal compression
Common pattern:
.json.gz,.csv.gzfor temporary storage
Use cases:
API responses and requests
Configuration files
Ingestion/landing zones in data lakes
Debugging and development
Small datasets where simplicity matters
Binary Formats (Structured/Semi-Structured)
Not human-readable but compact and fast to parse. Require specialized tools or libraries to inspect and work with.
General Purpose Serialization
Protocol Buffers (Protobuf):
Structure: Structured (strongly typed, schema-enforced)
Orientation: Row-oriented (single messages)
Schema: External
.protofiles, field numbering for evolutionUse case: RPC/gRPC, microservices communication, event streaming (Kafka)
Strengths: Extremely compact, fast serialization, backward/forward compatibility
Limitations: Not optimized for analytical queries, no built-in indexing
Compression: Already compact; additional compression optional
Pipeline role: Transport and interchange—ingestion from applications, message passing
Apache Avro:
Structure: Structured (schema-enforced but with evolution support)
Orientation: Row-oriented
Schema: Embedded in file or external registry (JSON-defined)
Use case: Streaming pipelines (Kafka), evolving schemas, data serialization
Strengths: Excellent schema evolution (forward/backward compatible), compact, self-describing
Compression: Supports Snappy, GZIP, deflate; compresses well due to schema separation
Pipeline role: Streaming layer—handles schema changes gracefully over time
MessagePack:
Structure: Semi-structured (flexible like JSON)
Orientation: Row-oriented
Schema: Schema-less (like JSON but binary)
Use case: Binary alternative to JSON for APIs, faster serialization
Strengths: Drop-in JSON replacement, faster and smaller
Compression: More compact than JSON; benefits from additional compression
BSON:
Structure: Semi-structured (JSON-like with types)
Orientation: Row-oriented (documents)
Schema: Schema-less but type-aware
Use case: MongoDB, document databases
Strengths: Additional data types (dates, binary), traversable without parsing entire document
Compression: Less compact than MessagePack; used for queryability not storage efficiency
Columnar Formats (Analytics-Optimized)
Apache Parquet:
Structure: Structured (strict schema, strong typing)
Orientation: Columnar—stores each column's values together
Schema: Embedded in file metadata with column statistics (min/max, null counts)
Use case: Data lakes, analytical storage, Spark/Hive workloads, long-term storage
Strengths:
Excellent compression (10-100x): columnar layout + encoding (dictionary, RLE)
Query optimization: column pruning, predicate pushdown, statistics-based skipping
Only read columns you need
Industry standard for analytics
Compression: Snappy (fast), GZIP (high ratio), LZ4, Zstandard
Access pattern: Read few columns from many rows (analytical queries)
Pipeline role: Storage layer—optimized for fast analytical queries
ORC (Optimized Row Columnar):
Structure: Structured (strict schema)
Orientation: Columnar with lightweight indexes
Schema: Embedded with rich type system
Use case: Hive-heavy environments, maximum compression needs
Strengths: Similar to Parquet but often better compression, built-in bloom filters
Compression: ZLIB, Snappy, LZ4, Zstandard
Pipeline role: Alternative to Parquet, especially in Hadoop ecosystems
Apache Arrow:
Structure: Structured (typed columnar)
Orientation: Columnar in-memory format
Schema: Standardized memory layout
Use case: Zero-copy data sharing between processes (Pandas, Spark, databases)
Strengths:
No serialization overhead between compatible systems
Fast in-memory analytics
Language-agnostic standard
Compression: Not primarily a storage format, but supports compression in flight
Pipeline role: In-memory interchange—fast data transfer between tools without conversion
Specialized Binary Formats
Media formats:
Images: JPEG (lossy), PNG (lossless), WebP
Audio: MP3/AAC (lossy), FLAC (lossless)
Video: MP4, MKV containers with H.264/H.265 codecs (lossy)
Archives:
ZIP, TAR, GZIP, 7z: Container formats with compression
Row-Oriented vs. Column-Oriented: The Critical Distinction
This is the fundamental divide in data engineering formats:
Row-Oriented (JSON, CSV, Avro, Protobuf):
Store complete records together:
[id, name, age, salary], [id, name, age, salary], ...Best for: Transactional workloads (OLTP), writing individual records, reading entire rows
Use case: Application logs, event streams, APIs where you need all fields
Example: Insert user record, stream events from Kafka
Column-Oriented (Parquet, ORC, Arrow):
Store each column separately:
[all ids], [all names], [all ages], [all salaries]Best for: Analytical workloads (OLAP), aggregations, reading subset of columns
Use case: Data warehouses, analytics on wide tables
Example: "Calculate average salary by department"—touches 2 of 50 columns
Performance impact:
Scenario: 100 columns, 1 billion rows, query needs 2 columns
Row format: Read all 100 columns × 1B rows, discard 98%
Column format: Read only 2 columns × 1B rows = 50x less I/O
The Role of Schema
Schema is the foundation of efficiency differences between formats:
Schema-less/Implicit (JSON, CSV):
Field names repeated in every record:
{"name":"Alice"}millions of timesNumbers stored as text:
"12345"is 5 bytes instead of 4No type enforcement or validation until read time
Maximum flexibility but storage/query overhead
Schema-enforced (Parquet, Avro, Protobuf):
Schema stored once (file header or registry)
Field names replaced by positions/IDs
Binary type encoding: integers as 4 bytes, not text
Query engines use schema for optimization (column pruning, predicate pushdown)
Validation at write time
Schema evolution:
Avro: Forward/backward compatible, fields can be added/removed safely
Protobuf: Field numbering allows deprecation without breaking compatibility
Parquet: Can add columns without rewriting existing data
JSON/CSV: No formal evolution, just handle inconsistencies
Compression Deep Dive
Compression reduces storage, speeds up I/O, and lowers network costs. The choice depends on your access patterns.
Lossless Compression
Preserves every bit of original data:
GZIP - excellent compression ratios (5-10x), slower, standard for text
Snappy - moderate compression (2-4x), very fast, popular in Parquet
LZ4 - lower compression (2-3x), extremely fast decompression
Zstandard (zstd) - modern algorithm balancing speed and ratio, increasingly popular
When to use:
High ratio (GZIP, zstd): Compress once, read many times (archival, cold storage)
Fast (Snappy, LZ4): Frequent writes and reads (hot data, real-time pipelines)
Columnar advantage:
Similar data compresses better: column of dates compresses far better than mixed row data
Dictionary encoding: 1M records with 100 unique names → store mapping once + indices
Run-length encoding: repeated values stored as count + value
Lossy Compression
Discards less important information for dramatic size reduction:
JPEG - removes visual details humans can't perceive
MP3/AAC - removes inaudible frequencies
H.264/H.265 - video codecs with temporal/spatial compression
Only acceptable for media formats where some quality loss is tolerable.
Compression Comparison
For 1 million records with 10 columns:
CSV uncompressed: ~500 MB
CSV + GZIP: ~100 MB (5x)
JSON uncompressed: ~800 MB
JSON + GZIP: ~150 MB (5x)
Parquet + Snappy: ~50 MB (10x)
Parquet + GZIP: ~30 MB (16x)
Why Parquet compresses better:
Binary encoding (no text overhead)
Columnar layout (similar data together)
Schema separation (no repeated field names)
Specialized encodings (dictionary, RLE)
Format Selection in Data Engineering Pipelines
Typical Pipeline Flow
Decision Framework
Ingestion Layer (Landing Zone):
Use: JSON, CSV, Avro
Why: Flexibility for diverse sources, easy ingestion, schema evolution
Compression: GZIP for temporary storage
Pattern: Accept data as-is, preserve raw format
Streaming Layer:
Use: Avro, Protobuf
Why: Schema enforcement, compact, handles evolution, fast serialization
Platform: Kafka, Pub/Sub with schema registry
Pattern: Typed messages with versioned schemas
Processing Layer:
Use: Arrow in-memory
Why: Zero-copy between Spark/Pandas/Dask, fast transformations
Pattern: Minimize serialization overhead during multi-step processing
Storage Layer (Data Lake/Warehouse):
Use: Parquet, ORC
Why: Columnar compression, query optimization, long-term efficiency
Compression: Snappy for hot data, GZIP for cold storage
Pattern: Partitioned by date/category for query pruning
Analytics Layer:
Query: SQL on Parquet/ORC
Benefits: Column pruning, predicate pushdown, statistics-based skipping
Performance: 50-100x faster than parsing JSON
Specific Use Cases
Real-time event streaming:
Protobuf: Compact messages for transport
Parquet: Analytics-optimized storage
Batch ETL with schema evolution:
JSON: Flexible ingestion
Avro: Handle schema changes during transformation
Parquet: Final analytical storage
Multi-tool analytics:
Parquet: Efficient storage
Arrow: Zero-copy sharing between languages
Long-term archival:
Avro: Self-describing, schema evolution support
GZIP: Maximum compression for rarely accessed data
Key Takeaways
Structure matters: Semi-structured (JSON) for flexibility, structured (Parquet) for efficiency
Orientation matters: Row-oriented for transactions, column-oriented for analytics
Schema is critical: Enables compression, validation, and query optimization
Compression strategy: Balance ratio vs. speed based on access patterns
Pipeline evolution: Semi-structured → Structured as data moves from ingestion to analytics
Specialized formats exist for a reason: Parquet/ORC for analytics, Avro/Protobuf for streaming, Arrow for in-memory
The "right" format depends on your specific use case: access patterns, query types, data volume, schema stability, and position in the pipeline. Modern data platforms often use multiple formats, each optimized for its role in the overall architecture.
Last updated