Types of data and their formats

General overview of data storage formats

A comparison tables of data serialization formatsarrow-up-right from Wikipedia.

Article about advantages of Apache Avro for schema management: https://www.oreilly.com/content/the-problem-of-managing-schemas/arrow-up-right

Comparison of schema evolution between Avro, Protobuf, and Thrift: https://martin.kleppmann.com/2012/12/05/schema-evolution-in-avro-protocol-buffers-thrift.htmlarrow-up-right


Understanding Data Formats: A Comprehensive Guide

When working with data, it's important to distinguish between encoding (how data is represented as bytes) and storage formats (how those bytes are organized on disk). Encoding is also called serialization or marshalling—the process of converting in-memory data structures into byte sequences. The reverse is decoding, deserialization, or unmarshalling.

A Note on Terminology

The term "serialization" appears in two completely different contexts. In the data world, it means converting data structures to byte sequences. In the database world, it refers to transaction isolation guarantees. To avoid confusion, many prefer the terms "encoding" and "decoding" when discussing data formats.

Data Structure Classification

Data can be classified into three fundamental categories:

Unstructured Data:

  • No predefined schema or organization: images, videos, audio, PDFs, raw text

  • Requires ML/NLP to extract meaning

  • Stored as raw binary files in object storage (S3, blob storage)

Semi-Structured Data:

  • Has organizational structure but flexible schema: JSON, XML, CSV

  • Self-describing with field names embedded in data

  • Schema can vary between records

  • Human-readable and somewhat queryable

Structured Data:

  • Rigid, predefined schema with enforced types: relational databases, Parquet, Avro

  • All records conform to same structure

  • Strong typing enforced

  • Optimized for queries and analytics

Most data engineering pipelines follow an evolution: Semi-Structured (ingestion) → Structured (storage/analytics), trading flexibility for efficiency.

Human-Readable Formats (Semi-Structured)

Text-based formats that can be opened in any text editor, excellent for debugging and version control. They're larger and slower to parse than binary formats but much easier to work with.

  • JSON - ubiquitous for web APIs and configuration files, nested structures, flexible schema

  • XML - robust schema validation, hierarchical, common in enterprise systems

  • YAML - human-friendly with minimal syntax, popular for configuration

  • CSV - simple tabular data, implicit schema

Schema characteristics:

  • Schema-less or implicit—no enforced structure

  • Field names repeated in every record (storage overhead)

  • Data types are text strings requiring parsing: "123" could be string or number

  • No validation until read time

  • Maximum flexibility for evolving schemas

Compression:

  • Can be compressed with GZIP/Zstandard (typically 5x compression)

  • Text overhead remains: numbers and dates stored as text, not binary

  • Mixed data types prevent optimal compression

  • Common pattern: .json.gz, .csv.gz for temporary storage

Use cases:

  • API responses and requests

  • Configuration files

  • Ingestion/landing zones in data lakes

  • Debugging and development

  • Small datasets where simplicity matters

Binary Formats (Structured/Semi-Structured)

Not human-readable but compact and fast to parse. Require specialized tools or libraries to inspect and work with.

General Purpose Serialization

Protocol Buffers (Protobuf):

  • Structure: Structured (strongly typed, schema-enforced)

  • Orientation: Row-oriented (single messages)

  • Schema: External .proto files, field numbering for evolution

  • Use case: RPC/gRPC, microservices communication, event streaming (Kafka)

  • Strengths: Extremely compact, fast serialization, backward/forward compatibility

  • Limitations: Not optimized for analytical queries, no built-in indexing

  • Compression: Already compact; additional compression optional

  • Pipeline role: Transport and interchange—ingestion from applications, message passing

Apache Avro:

  • Structure: Structured (schema-enforced but with evolution support)

  • Orientation: Row-oriented

  • Schema: Embedded in file or external registry (JSON-defined)

  • Use case: Streaming pipelines (Kafka), evolving schemas, data serialization

  • Strengths: Excellent schema evolution (forward/backward compatible), compact, self-describing

  • Compression: Supports Snappy, GZIP, deflate; compresses well due to schema separation

  • Pipeline role: Streaming layer—handles schema changes gracefully over time

MessagePack:

  • Structure: Semi-structured (flexible like JSON)

  • Orientation: Row-oriented

  • Schema: Schema-less (like JSON but binary)

  • Use case: Binary alternative to JSON for APIs, faster serialization

  • Strengths: Drop-in JSON replacement, faster and smaller

  • Compression: More compact than JSON; benefits from additional compression

BSON:

  • Structure: Semi-structured (JSON-like with types)

  • Orientation: Row-oriented (documents)

  • Schema: Schema-less but type-aware

  • Use case: MongoDB, document databases

  • Strengths: Additional data types (dates, binary), traversable without parsing entire document

  • Compression: Less compact than MessagePack; used for queryability not storage efficiency

Columnar Formats (Analytics-Optimized)

Apache Parquet:

  • Structure: Structured (strict schema, strong typing)

  • Orientation: Columnar—stores each column's values together

  • Schema: Embedded in file metadata with column statistics (min/max, null counts)

  • Use case: Data lakes, analytical storage, Spark/Hive workloads, long-term storage

  • Strengths:

    • Excellent compression (10-100x): columnar layout + encoding (dictionary, RLE)

    • Query optimization: column pruning, predicate pushdown, statistics-based skipping

    • Only read columns you need

    • Industry standard for analytics

  • Compression: Snappy (fast), GZIP (high ratio), LZ4, Zstandard

  • Access pattern: Read few columns from many rows (analytical queries)

  • Pipeline role: Storage layer—optimized for fast analytical queries

ORC (Optimized Row Columnar):

  • Structure: Structured (strict schema)

  • Orientation: Columnar with lightweight indexes

  • Schema: Embedded with rich type system

  • Use case: Hive-heavy environments, maximum compression needs

  • Strengths: Similar to Parquet but often better compression, built-in bloom filters

  • Compression: ZLIB, Snappy, LZ4, Zstandard

  • Pipeline role: Alternative to Parquet, especially in Hadoop ecosystems

Apache Arrow:

  • Structure: Structured (typed columnar)

  • Orientation: Columnar in-memory format

  • Schema: Standardized memory layout

  • Use case: Zero-copy data sharing between processes (Pandas, Spark, databases)

  • Strengths:

    • No serialization overhead between compatible systems

    • Fast in-memory analytics

    • Language-agnostic standard

  • Compression: Not primarily a storage format, but supports compression in flight

  • Pipeline role: In-memory interchange—fast data transfer between tools without conversion

Specialized Binary Formats

Media formats:

  • Images: JPEG (lossy), PNG (lossless), WebP

  • Audio: MP3/AAC (lossy), FLAC (lossless)

  • Video: MP4, MKV containers with H.264/H.265 codecs (lossy)

Archives:

  • ZIP, TAR, GZIP, 7z: Container formats with compression

Row-Oriented vs. Column-Oriented: The Critical Distinction

This is the fundamental divide in data engineering formats:

Row-Oriented (JSON, CSV, Avro, Protobuf):

  • Store complete records together: [id, name, age, salary], [id, name, age, salary], ...

  • Best for: Transactional workloads (OLTP), writing individual records, reading entire rows

  • Use case: Application logs, event streams, APIs where you need all fields

  • Example: Insert user record, stream events from Kafka

Column-Oriented (Parquet, ORC, Arrow):

  • Store each column separately: [all ids], [all names], [all ages], [all salaries]

  • Best for: Analytical workloads (OLAP), aggregations, reading subset of columns

  • Use case: Data warehouses, analytics on wide tables

  • Example: "Calculate average salary by department"—touches 2 of 50 columns

Performance impact:

  • Scenario: 100 columns, 1 billion rows, query needs 2 columns

  • Row format: Read all 100 columns × 1B rows, discard 98%

  • Column format: Read only 2 columns × 1B rows = 50x less I/O

The Role of Schema

Schema is the foundation of efficiency differences between formats:

Schema-less/Implicit (JSON, CSV):

  • Field names repeated in every record: {"name":"Alice"} millions of times

  • Numbers stored as text: "12345" is 5 bytes instead of 4

  • No type enforcement or validation until read time

  • Maximum flexibility but storage/query overhead

Schema-enforced (Parquet, Avro, Protobuf):

  • Schema stored once (file header or registry)

  • Field names replaced by positions/IDs

  • Binary type encoding: integers as 4 bytes, not text

  • Query engines use schema for optimization (column pruning, predicate pushdown)

  • Validation at write time

Schema evolution:

  • Avro: Forward/backward compatible, fields can be added/removed safely

  • Protobuf: Field numbering allows deprecation without breaking compatibility

  • Parquet: Can add columns without rewriting existing data

  • JSON/CSV: No formal evolution, just handle inconsistencies

Compression Deep Dive

Compression reduces storage, speeds up I/O, and lowers network costs. The choice depends on your access patterns.

Lossless Compression

Preserves every bit of original data:

  • GZIP - excellent compression ratios (5-10x), slower, standard for text

  • Snappy - moderate compression (2-4x), very fast, popular in Parquet

  • LZ4 - lower compression (2-3x), extremely fast decompression

  • Zstandard (zstd) - modern algorithm balancing speed and ratio, increasingly popular

When to use:

  • High ratio (GZIP, zstd): Compress once, read many times (archival, cold storage)

  • Fast (Snappy, LZ4): Frequent writes and reads (hot data, real-time pipelines)

Columnar advantage:

  • Similar data compresses better: column of dates compresses far better than mixed row data

  • Dictionary encoding: 1M records with 100 unique names → store mapping once + indices

  • Run-length encoding: repeated values stored as count + value

Lossy Compression

Discards less important information for dramatic size reduction:

  • JPEG - removes visual details humans can't perceive

  • MP3/AAC - removes inaudible frequencies

  • H.264/H.265 - video codecs with temporal/spatial compression

Only acceptable for media formats where some quality loss is tolerable.

Compression Comparison

For 1 million records with 10 columns:

  • CSV uncompressed: ~500 MB

  • CSV + GZIP: ~100 MB (5x)

  • JSON uncompressed: ~800 MB

  • JSON + GZIP: ~150 MB (5x)

  • Parquet + Snappy: ~50 MB (10x)

  • Parquet + GZIP: ~30 MB (16x)

Why Parquet compresses better:

  1. Binary encoding (no text overhead)

  2. Columnar layout (similar data together)

  3. Schema separation (no repeated field names)

  4. Specialized encodings (dictionary, RLE)

Format Selection in Data Engineering Pipelines

Typical Pipeline Flow

Decision Framework

Ingestion Layer (Landing Zone):

  • Use: JSON, CSV, Avro

  • Why: Flexibility for diverse sources, easy ingestion, schema evolution

  • Compression: GZIP for temporary storage

  • Pattern: Accept data as-is, preserve raw format

Streaming Layer:

  • Use: Avro, Protobuf

  • Why: Schema enforcement, compact, handles evolution, fast serialization

  • Platform: Kafka, Pub/Sub with schema registry

  • Pattern: Typed messages with versioned schemas

Processing Layer:

  • Use: Arrow in-memory

  • Why: Zero-copy between Spark/Pandas/Dask, fast transformations

  • Pattern: Minimize serialization overhead during multi-step processing

Storage Layer (Data Lake/Warehouse):

  • Use: Parquet, ORC

  • Why: Columnar compression, query optimization, long-term efficiency

  • Compression: Snappy for hot data, GZIP for cold storage

  • Pattern: Partitioned by date/category for query pruning

Analytics Layer:

  • Query: SQL on Parquet/ORC

  • Benefits: Column pruning, predicate pushdown, statistics-based skipping

  • Performance: 50-100x faster than parsing JSON

Specific Use Cases

Real-time event streaming:

  • Protobuf: Compact messages for transport

  • Parquet: Analytics-optimized storage

Batch ETL with schema evolution:

  • JSON: Flexible ingestion

  • Avro: Handle schema changes during transformation

  • Parquet: Final analytical storage

Multi-tool analytics:

  • Parquet: Efficient storage

  • Arrow: Zero-copy sharing between languages

Long-term archival:

  • Avro: Self-describing, schema evolution support

  • GZIP: Maximum compression for rarely accessed data

Key Takeaways

  1. Structure matters: Semi-structured (JSON) for flexibility, structured (Parquet) for efficiency

  2. Orientation matters: Row-oriented for transactions, column-oriented for analytics

  3. Schema is critical: Enables compression, validation, and query optimization

  4. Compression strategy: Balance ratio vs. speed based on access patterns

  5. Pipeline evolution: Semi-structured → Structured as data moves from ingestion to analytics

  6. Specialized formats exist for a reason: Parquet/ORC for analytics, Avro/Protobuf for streaming, Arrow for in-memory

The "right" format depends on your specific use case: access patterns, query types, data volume, schema stability, and position in the pipeline. Modern data platforms often use multiple formats, each optimized for its role in the overall architecture.

Last updated