Types of storage
The Physical Foundation: Disk Blocks
Beneath all these abstractions lies the physical storage medium, organized into disk blocks (also called sectors or physical blocks). These represent the actual hardware-level units that the storage device can read or write in a single operation.
Modern hard drives typically use 4KB physical blocks, while SSDs may use different page sizes (often 4KB, 8KB, or 16KB). These physical blocks represent the atomic unit of I/O at the hardware level—the device cannot read or write anything smaller without retrieving the entire block.
The Abstraction Hierarchy
The relationship between these levels forms a layered architecture:
Physical disk blocks: The hardware's native addressable units
Block storage: Logical blocks exposed to the operating system (may align with or abstract physical blocks)
File systems: Built on block storage to provide file and directory abstractions
Object storage: Often built on file systems or block storage, adding its own object-oriented layer
Each layer introduces overhead but provides valuable functionality. Block storage adds minimal overhead while enabling flexibility. File systems add metadata management and hierarchical organization. Object storage adds scalability, rich metadata, and simplified distribution at the cost of access pattern constraints (primarily read-entire-object operations).
Feature
Object Storage
Block Storage
File Storage
Structure
Flat pool of objects
Grid of blocks
Hierarchical folders
Access
API (PUT, GET)
OS/Driver (Read/Write blocks)
OS/Network (Read/Write files)
Metadata
Rich, customizable
None
Fixed, limited
Best For
Unstructured data, backup, archive, web apps
Databases, high-performance computing
Shared files, user directories
Block Storage
Block storage operates at the most granular level, presenting storage as a collection of fixed-size blocks that can be individually read or written. Each block has a unique address, and the storage system doesn't impose any structure on how data is organized within those blocks.

Characteristics:
Provides raw storage volumes that appear as local disks to the operating system
Requires formatting with a file system (ext4, XFS, NTFS) before use
Supports both read and write operations at the block level
Offers low-latency access since it's typically directly attached or on high-speed networks
Allows incremental updates—you can modify small portions without rewriting entire files
AWS Example: Amazon EBS (Elastic Block Store)
Attaches to a single EC2 instance at a time (though some volume types support multi-attach)
Persists independently of instance lifecycle
Offers different performance tiers (gp3, io2, st1, sc1) based on IOPS and throughput needs
Supports snapshots for backup and recovery
Volume sizes from 1 GB to 64 TB
Use Cases for Data Engineering:
Database storage (PostgreSQL, MySQL, MongoDB) where you need consistent low-latency access
Transactional workloads requiring frequent random reads and writes
Running analytics databases like ClickHouse or TimescaleDB on EC2
Storing Kafka logs or other streaming data that requires fast sequential writes
Running Spark or Hadoop worker nodes where intermediate shuffle data needs fast local storage
Limitations:
Single availability zone—if the AZ fails, the volume is unavailable until recovery
Typically attached to one instance, limiting concurrent access
Costs scale linearly with provisioned capacity, regardless of actual usage
Manual management of capacity—you must provision and resize volumes explicitly
File Storage
File storage adds a hierarchical namespace on top of block storage, organizing data into directories and files. It provides shared access through network protocols, allowing multiple clients to read and write simultaneously.

Characteristics:
Presents a POSIX-compliant file system interface
Supports concurrent access from multiple compute instances
Handles file locking, permissions, and metadata automatically
Scales elastically—capacity grows and shrinks based on actual usage
Maintains file system semantics (directories, symbolic links, permissions)
AWS Example: Amazon EFS (Elastic File System)
Automatically scales from gigabytes to petabytes
Accessible from multiple EC2 instances simultaneously across multiple AZs
Supports NFS v4 protocol
Offers different storage classes (Standard, Infrequent Access)
Provides lifecycle management to move files between storage classes
Use Cases for Data Engineering:
Shared configuration files and scripts across a cluster of processing nodes
Home directories for user workspaces in notebook environments (JupyterHub, Zeppelin)
Shared libraries and dependencies for containerized applications
Content management systems where multiple applications need access to the same files
Machine learning training data that needs to be accessed by multiple training jobs simultaneously
Centralized logging where multiple services write to shared log directories
Limitations:
Higher latency compared to block storage due to network overhead
More expensive per GB than object storage
Performance can be inconsistent with highly concurrent workloads
Not ideal for extremely high-throughput sequential reads/writes
POSIX semantics can create bottlenecks at scale (directory listing operations, for example)
Object Storage
Object storage treats data as discrete, self-contained objects rather than files or blocks. Each object consists of data, metadata, and a unique identifier. The architecture is designed for massive scalability and distributed access.
Each piece of data you store is called an Object. As shown in the diagram below, an object is made up of three key components:
Data (Payload): The actual file content you are storing (e.g., an image, a document, a log file).
Metadata: Customizable labels and tags that describe the object (e.g.,
author=jane,project=website,retention=1year). This makes it easy to search and manage your data.Unique ID (Key): A unique identifier, similar to a URL, that you use to retrieve the object (e.g.,
my-bucket/photos/vacation/beach.jpg).
You interact with this storage pool over the internet using simple API commands like PUT (to upload), GET (to download), and DELETE.

Characteristics:
Flat namespace—objects are identified by keys rather than hierarchical paths (though keys can simulate paths)
Objects are immutable—modifications require uploading a new version
Stores unlimited amounts of data with virtually unlimited objects
Highly durable (11 nines of durability for S3)
Eventually consistent for overwrites and deletes (though S3 now offers strong consistency)
Rich metadata support—arbitrary key-value pairs per object
Accessible via HTTP/HTTPS APIs rather than file system protocols
AWS Example: Amazon S3 (Simple Storage Service)
Organizes objects into buckets (top-level containers)
Supports versioning, lifecycle policies, and event notifications
Multiple storage classes (Standard, Intelligent-Tiering, Glacier, Deep Archive)
Integrates with virtually every AWS service
Offers features like S3 Select for querying data without downloading
Use Cases for Data Engineering:
Data lakes—storing raw data in various formats (CSV, JSON, Parquet, Avro)
Long-term archival and backup storage
Storing results from batch processing jobs (Spark output, ETL results)
Serving as source/destination for data pipelines (Airflow, Glue, Step Functions)
Hosting static datasets for analytics and machine learning
Storing intermediate results between pipeline stages
Log aggregation and retention
Serving as a staging area for data warehouse loading (Redshift, Snowflake)
Immutable audit trails and compliance data
Advanced S3 Features for Data Engineers:
S3 Select: Query CSV/JSON/Parquet files using SQL without downloading entire objects
Partitioning: Organize data using key prefixes (e.g.,
s3://bucket/year=2025/month=01/day=15/data.parquet)Storage Classes: Automatically transition data between tiers based on access patterns
Batch Operations: Perform operations on billions of objects (copying, tagging, restoring)
Event Notifications: Trigger Lambda functions or SQS messages when objects are created/deleted
Transfer Acceleration: Speed up uploads using CloudFront edge locations
Requester Pays: Let data consumers pay for transfer costs
Limitations:
Higher latency than block/file storage—typically 100-200ms for first byte
Must read/write entire objects—no partial updates (though multipart upload helps)
LIST operations can be slow with millions of objects
Not suitable for transactional workloads or databases
Limited query capabilities without additional tools (Athena, Presto, Spark)
Network bandwidth becomes a bottleneck for frequent access patterns
Object Storage Versioning
One of the most powerful features of object storage is versioning. When you enable versioning on a bucket, the system keeps a history of every object.
When you update an object (by uploading a new file with the same key), it doesn't overwrite the old data. Instead, it creates a new version and makes it the current one. The old version is preserved, and you can still access it by specifying its unique version ID.
This is crucial for data protection. It allows you to recover from accidental deletions or overwrites. The diagram below illustrates this concept, showing a single logical object (image.jpg) with a history of three distinct versions.

In this diagram, a GET request for image.jpg returns the current version (Version 3). However, a request for image.jpg?versionId=v1 would retrieve the original file (Version 1). The diagram also shows that even after a 'DELETE' operation, the deleted version is still stored and can be recovered if needed.
Hybrid and Specialized Storage Options
Amazon FSx Family:
FSx for Lustre: High-performance parallel file system for HPC and ML training workloads. Integrates with S3 for seamless data access. Offers throughput up to hundreds of GB/s.
FSx for Windows File Server: Fully managed Windows file servers with SMB protocol support
FSx for NetApp ONTAP: Enterprise NAS with advanced data management features
FSx for OpenZFS: File storage built on OpenZFS with snapshots and cloning
Use case: When EFS doesn't provide enough performance, FSx for Lustre is ideal for ML training, genomics processing, or financial modeling requiring massive parallel throughput.
AWS Storage Gateway:
Hybrid cloud storage connecting on-premises environments to AWS
File Gateway: Presents S3 as NFS/SMB file shares
Volume Gateway: Presents S3 as iSCSI block storage
Tape Gateway: Virtual tape library for backup applications
Use case: Gradually migrating on-premises data workloads to cloud while maintaining local access patterns.
Amazon S3 on Outposts:
Run S3 object storage on-premises with the same APIs
Useful for data residency requirements or low-latency local access
Performance Comparison
Here's the performance comparison table:
Block (EBS io2)
<1ms
1,000+ MB/s
64,000+
Single instance
High
Block (EBS gp3)
Single-digit ms
125-1,000 MB/s
3,000-16,000
Single instance
Medium
File (EFS)
Low ms
Bursts to 10+ GB/s
Varies
Multiple instances
Medium-High
Object (S3)
100-200ms
Parallel scaling
N/A
Unlimited
Low
FSx Lustre
Sub-ms
Hundreds of GB/s
Millions
Multiple instances
High
Key Notes:
Latency: Time to access first byte of data
Throughput: Maximum data transfer rate (sequential operations)
IOPS: Input/Output Operations Per Second (random access operations)
Concurrency: Number of clients/instances that can access simultaneously
Cost: Relative pricing per GB stored and per operation
For Data Engineering Context:
Use Block (io2) for high-performance databases requiring consistent low latency
Use Block (gp3) for general-purpose workloads with balanced performance/cost
Use File (EFS) when multiple workers need shared access to the same datasets
Use Object (S3) for data lakes and long-term storage where cost optimization matters
Use FSx Lustre for ML training or HPC workloads requiring extreme parallel throughput
Choosing the Right Storage
Use Block Storage (EBS) when:
Running databases that need low-latency random access
Operating systems and boot volumes
Real-time analytics requiring fast local storage
Single-instance applications with high IOPS requirements
Use File Storage (EFS/FSx) when:
Multiple instances need shared access to the same data
You need POSIX file system semantics
Running containerized applications with shared configuration
ML training jobs accessing shared datasets
Use Object Storage (S3) when:
Building data lakes with diverse data types
Long-term retention and archival
Integrating with AWS analytics services (Athena, EMR, Glue)
Storing immutable data like logs, backups, or compliance records
Cost optimization is critical and access patterns allow higher latency
Data Engineering Pattern: The Three-Tier Approach
Many data engineering architectures use all three:
S3 for raw data ingestion, data lake storage, and long-term retention
EBS for database storage (Postgres, Airflow metadata) and temporary processing
EFS for shared resources (DAGs, scripts, libraries) across compute clusters
For example, a typical Spark pipeline might:
Read source data from S3 (cost-effective, scalable)
Use EBS volumes for local shuffle data during processing (fast random I/O)
Share UDF libraries and configuration via EFS (accessible to all workers)
Write results back to S3 in optimized formats like Parquet
This layered approach optimizes for both cost and performance across different stages of the data lifecycle.
Last updated