Types of storage


The Physical Foundation: Disk Blocks

Beneath all these abstractions lies the physical storage medium, organized into disk blocks (also called sectors or physical blocks). These represent the actual hardware-level units that the storage device can read or write in a single operation.

Modern hard drives typically use 4KB physical blocks, while SSDs may use different page sizes (often 4KB, 8KB, or 16KB). These physical blocks represent the atomic unit of I/O at the hardware level—the device cannot read or write anything smaller without retrieving the entire block.

The Abstraction Hierarchy

The relationship between these levels forms a layered architecture:

  1. Physical disk blocks: The hardware's native addressable units

  2. Block storage: Logical blocks exposed to the operating system (may align with or abstract physical blocks)

  3. File systems: Built on block storage to provide file and directory abstractions

  4. Object storage: Often built on file systems or block storage, adding its own object-oriented layer

Each layer introduces overhead but provides valuable functionality. Block storage adds minimal overhead while enabling flexibility. File systems add metadata management and hierarchical organization. Object storage adds scalability, rich metadata, and simplified distribution at the cost of access pattern constraints (primarily read-entire-object operations).

Feature

Object Storage

Block Storage

File Storage

Structure

Flat pool of objects

Grid of blocks

Hierarchical folders

Access

API (PUT, GET)

OS/Driver (Read/Write blocks)

OS/Network (Read/Write files)

Metadata

Rich, customizable

None

Fixed, limited

Best For

Unstructured data, backup, archive, web apps

Databases, high-performance computing

Shared files, user directories

Block Storage

Block storage operates at the most granular level, presenting storage as a collection of fixed-size blocks that can be individually read or written. Each block has a unique address, and the storage system doesn't impose any structure on how data is organized within those blocks.

Characteristics:

  • Provides raw storage volumes that appear as local disks to the operating system

  • Requires formatting with a file system (ext4, XFS, NTFS) before use

  • Supports both read and write operations at the block level

  • Offers low-latency access since it's typically directly attached or on high-speed networks

  • Allows incremental updates—you can modify small portions without rewriting entire files

AWS Example: Amazon EBS (Elastic Block Store)

  • Attaches to a single EC2 instance at a time (though some volume types support multi-attach)

  • Persists independently of instance lifecycle

  • Offers different performance tiers (gp3, io2, st1, sc1) based on IOPS and throughput needs

  • Supports snapshots for backup and recovery

  • Volume sizes from 1 GB to 64 TB

Use Cases for Data Engineering:

  • Database storage (PostgreSQL, MySQL, MongoDB) where you need consistent low-latency access

  • Transactional workloads requiring frequent random reads and writes

  • Running analytics databases like ClickHouse or TimescaleDB on EC2

  • Storing Kafka logs or other streaming data that requires fast sequential writes

  • Running Spark or Hadoop worker nodes where intermediate shuffle data needs fast local storage

Limitations:

  • Single availability zone—if the AZ fails, the volume is unavailable until recovery

  • Typically attached to one instance, limiting concurrent access

  • Costs scale linearly with provisioned capacity, regardless of actual usage

  • Manual management of capacity—you must provision and resize volumes explicitly


File Storage

File storage adds a hierarchical namespace on top of block storage, organizing data into directories and files. It provides shared access through network protocols, allowing multiple clients to read and write simultaneously.

Characteristics:

  • Presents a POSIX-compliant file system interface

  • Supports concurrent access from multiple compute instances

  • Handles file locking, permissions, and metadata automatically

  • Scales elastically—capacity grows and shrinks based on actual usage

  • Maintains file system semantics (directories, symbolic links, permissions)

AWS Example: Amazon EFS (Elastic File System)

  • Automatically scales from gigabytes to petabytes

  • Accessible from multiple EC2 instances simultaneously across multiple AZs

  • Supports NFS v4 protocol

  • Offers different storage classes (Standard, Infrequent Access)

  • Provides lifecycle management to move files between storage classes

Use Cases for Data Engineering:

  • Shared configuration files and scripts across a cluster of processing nodes

  • Home directories for user workspaces in notebook environments (JupyterHub, Zeppelin)

  • Shared libraries and dependencies for containerized applications

  • Content management systems where multiple applications need access to the same files

  • Machine learning training data that needs to be accessed by multiple training jobs simultaneously

  • Centralized logging where multiple services write to shared log directories

Limitations:

  • Higher latency compared to block storage due to network overhead

  • More expensive per GB than object storage

  • Performance can be inconsistent with highly concurrent workloads

  • Not ideal for extremely high-throughput sequential reads/writes

  • POSIX semantics can create bottlenecks at scale (directory listing operations, for example)


Object Storage

Object storage treats data as discrete, self-contained objects rather than files or blocks. Each object consists of data, metadata, and a unique identifier. The architecture is designed for massive scalability and distributed access.

Each piece of data you store is called an Object. As shown in the diagram below, an object is made up of three key components:

  • Data (Payload): The actual file content you are storing (e.g., an image, a document, a log file).

  • Metadata: Customizable labels and tags that describe the object (e.g., author=jane, project=website, retention=1year). This makes it easy to search and manage your data.

  • Unique ID (Key): A unique identifier, similar to a URL, that you use to retrieve the object (e.g., my-bucket/photos/vacation/beach.jpg).

You interact with this storage pool over the internet using simple API commands like PUT (to upload), GET (to download), and DELETE.

Characteristics:

  • Flat namespace—objects are identified by keys rather than hierarchical paths (though keys can simulate paths)

  • Objects are immutable—modifications require uploading a new version

  • Stores unlimited amounts of data with virtually unlimited objects

  • Highly durable (11 nines of durability for S3)

  • Eventually consistent for overwrites and deletes (though S3 now offers strong consistency)

  • Rich metadata support—arbitrary key-value pairs per object

  • Accessible via HTTP/HTTPS APIs rather than file system protocols

AWS Example: Amazon S3 (Simple Storage Service)

  • Organizes objects into buckets (top-level containers)

  • Supports versioning, lifecycle policies, and event notifications

  • Multiple storage classes (Standard, Intelligent-Tiering, Glacier, Deep Archive)

  • Integrates with virtually every AWS service

  • Offers features like S3 Select for querying data without downloading

Use Cases for Data Engineering:

  • Data lakes—storing raw data in various formats (CSV, JSON, Parquet, Avro)

  • Long-term archival and backup storage

  • Storing results from batch processing jobs (Spark output, ETL results)

  • Serving as source/destination for data pipelines (Airflow, Glue, Step Functions)

  • Hosting static datasets for analytics and machine learning

  • Storing intermediate results between pipeline stages

  • Log aggregation and retention

  • Serving as a staging area for data warehouse loading (Redshift, Snowflake)

  • Immutable audit trails and compliance data

Advanced S3 Features for Data Engineers:

  • S3 Select: Query CSV/JSON/Parquet files using SQL without downloading entire objects

  • Partitioning: Organize data using key prefixes (e.g., s3://bucket/year=2025/month=01/day=15/data.parquet)

  • Storage Classes: Automatically transition data between tiers based on access patterns

  • Batch Operations: Perform operations on billions of objects (copying, tagging, restoring)

  • Event Notifications: Trigger Lambda functions or SQS messages when objects are created/deleted

  • Transfer Acceleration: Speed up uploads using CloudFront edge locations

  • Requester Pays: Let data consumers pay for transfer costs

Limitations:

  • Higher latency than block/file storage—typically 100-200ms for first byte

  • Must read/write entire objects—no partial updates (though multipart upload helps)

  • LIST operations can be slow with millions of objects

  • Not suitable for transactional workloads or databases

  • Limited query capabilities without additional tools (Athena, Presto, Spark)

  • Network bandwidth becomes a bottleneck for frequent access patterns

Object Storage Versioning

One of the most powerful features of object storage is versioning. When you enable versioning on a bucket, the system keeps a history of every object.

When you update an object (by uploading a new file with the same key), it doesn't overwrite the old data. Instead, it creates a new version and makes it the current one. The old version is preserved, and you can still access it by specifying its unique version ID.

This is crucial for data protection. It allows you to recover from accidental deletions or overwrites. The diagram below illustrates this concept, showing a single logical object (image.jpg) with a history of three distinct versions.

In this diagram, a GET request for image.jpg returns the current version (Version 3). However, a request for image.jpg?versionId=v1 would retrieve the original file (Version 1). The diagram also shows that even after a 'DELETE' operation, the deleted version is still stored and can be recovered if needed.


Hybrid and Specialized Storage Options

Amazon FSx Family:

  • FSx for Lustre: High-performance parallel file system for HPC and ML training workloads. Integrates with S3 for seamless data access. Offers throughput up to hundreds of GB/s.

  • FSx for Windows File Server: Fully managed Windows file servers with SMB protocol support

  • FSx for NetApp ONTAP: Enterprise NAS with advanced data management features

  • FSx for OpenZFS: File storage built on OpenZFS with snapshots and cloning

Use case: When EFS doesn't provide enough performance, FSx for Lustre is ideal for ML training, genomics processing, or financial modeling requiring massive parallel throughput.

AWS Storage Gateway:

  • Hybrid cloud storage connecting on-premises environments to AWS

  • File Gateway: Presents S3 as NFS/SMB file shares

  • Volume Gateway: Presents S3 as iSCSI block storage

  • Tape Gateway: Virtual tape library for backup applications

Use case: Gradually migrating on-premises data workloads to cloud while maintaining local access patterns.

Amazon S3 on Outposts:

  • Run S3 object storage on-premises with the same APIs

  • Useful for data residency requirements or low-latency local access


Performance Comparison

Here's the performance comparison table:

Storage Type
Latency
Throughput
IOPS
Concurrency
Cost (relative)

Block (EBS io2)

<1ms

1,000+ MB/s

64,000+

Single instance

High

Block (EBS gp3)

Single-digit ms

125-1,000 MB/s

3,000-16,000

Single instance

Medium

File (EFS)

Low ms

Bursts to 10+ GB/s

Varies

Multiple instances

Medium-High

Object (S3)

100-200ms

Parallel scaling

N/A

Unlimited

Low

FSx Lustre

Sub-ms

Hundreds of GB/s

Millions

Multiple instances

High

Key Notes:

  • Latency: Time to access first byte of data

  • Throughput: Maximum data transfer rate (sequential operations)

  • IOPS: Input/Output Operations Per Second (random access operations)

  • Concurrency: Number of clients/instances that can access simultaneously

  • Cost: Relative pricing per GB stored and per operation

For Data Engineering Context:

  • Use Block (io2) for high-performance databases requiring consistent low latency

  • Use Block (gp3) for general-purpose workloads with balanced performance/cost

  • Use File (EFS) when multiple workers need shared access to the same datasets

  • Use Object (S3) for data lakes and long-term storage where cost optimization matters

  • Use FSx Lustre for ML training or HPC workloads requiring extreme parallel throughput


Choosing the Right Storage

Use Block Storage (EBS) when:

  • Running databases that need low-latency random access

  • Operating systems and boot volumes

  • Real-time analytics requiring fast local storage

  • Single-instance applications with high IOPS requirements

Use File Storage (EFS/FSx) when:

  • Multiple instances need shared access to the same data

  • You need POSIX file system semantics

  • Running containerized applications with shared configuration

  • ML training jobs accessing shared datasets

Use Object Storage (S3) when:

  • Building data lakes with diverse data types

  • Long-term retention and archival

  • Integrating with AWS analytics services (Athena, EMR, Glue)

  • Storing immutable data like logs, backups, or compliance records

  • Cost optimization is critical and access patterns allow higher latency

Data Engineering Pattern: The Three-Tier Approach

Many data engineering architectures use all three:

  1. S3 for raw data ingestion, data lake storage, and long-term retention

  2. EBS for database storage (Postgres, Airflow metadata) and temporary processing

  3. EFS for shared resources (DAGs, scripts, libraries) across compute clusters

For example, a typical Spark pipeline might:

  • Read source data from S3 (cost-effective, scalable)

  • Use EBS volumes for local shuffle data during processing (fast random I/O)

  • Share UDF libraries and configuration via EFS (accessible to all workers)

  • Write results back to S3 in optimized formats like Parquet

This layered approach optimizes for both cost and performance across different stages of the data lifecycle.


Last updated