# Types of storage

***

### **The Physical Foundation: Disk Blocks**

Beneath all these abstractions lies the physical storage medium, organized into disk blocks (also called sectors or physical blocks). These represent the actual hardware-level units that the storage device can read or write in a single operation.

Modern hard drives typically use 4KB physical blocks, while SSDs may use different page sizes (often 4KB, 8KB, or 16KB). These physical blocks represent the atomic unit of I/O at the hardware level—the device cannot read or write anything smaller without retrieving the entire block.

**The Abstraction Hierarchy**

The relationship between these levels forms a layered architecture:

1. **Physical disk blocks**: The hardware's native addressable units
2. **Block storage**: Logical blocks exposed to the operating system (may align with or abstract physical blocks)
3. **File systems**: Built on block storage to provide file and directory abstractions
4. **Object storage**: Often built on file systems or block storage, adding its own object-oriented layer

Each layer introduces overhead but provides valuable functionality. Block storage adds minimal overhead while enabling flexibility. File systems add metadata management and hierarchical organization. Object storage adds scalability, rich metadata, and simplified distribution at the cost of access pattern constraints (primarily read-entire-object operations).

<figure><img src="https://2332658533-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FG5fhKjYnbaQlTPTcaO85%2Fuploads%2FIIK3ZVvKj9HW5M1rCiRz%2Fstorage-types-diagram.svg?alt=media&#x26;token=87676000-eab7-4b55-aecf-ddab550603a3" alt=""><figcaption></figcaption></figure>

| **Feature** | **Object Storage**                           | **Block Storage**                     | **File Storage**               |
| ----------- | -------------------------------------------- | ------------------------------------- | ------------------------------ |
| Structure   | Flat pool of objects                         | Grid of blocks                        | Hierarchical folders           |
| Access      | API (PUT, GET)                               | OS/Driver (Read/Write blocks)         | OS/Network (Read/Write files)  |
| Metadata    | Rich, customizable                           | None                                  | Fixed, limited                 |
| Best For    | Unstructured data, backup, archive, web apps | Databases, high-performance computing | Shared files, user directories |

### **Block Storage**

Block storage operates at the most granular level, presenting storage as a collection of fixed-size blocks that can be individually read or written. Each block has a unique address, and the storage system doesn't impose any structure on how data is organized within those blocks.

<figure><img src="https://2332658533-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FG5fhKjYnbaQlTPTcaO85%2Fuploads%2FAhxOVuCEiWPdRclhniuy%2FGemini_Generated_Image_flfpqkflfpqkflfp%20(3).png?alt=media&#x26;token=039b47d8-8070-49c9-88ac-f72737fd013c" alt=""><figcaption></figcaption></figure>

**Characteristics:**

* Provides raw storage volumes that appear as local disks to the operating system
* Requires formatting with a file system (ext4, XFS, NTFS) before use
* Supports both read and write operations at the block level
* Offers low-latency access since it's typically directly attached or on high-speed networks
* Allows incremental updates—you can modify small portions without rewriting entire files

**AWS Example: Amazon EBS (Elastic Block Store)**

* Attaches to a single EC2 instance at a time (though some volume types support multi-attach)
* Persists independently of instance lifecycle
* Offers different performance tiers (gp3, io2, st1, sc1) based on IOPS and throughput needs
* Supports snapshots for backup and recovery
* Volume sizes from 1 GB to 64 TB

**Use Cases for Data Engineering:**

* Database storage (PostgreSQL, MySQL, MongoDB) where you need consistent low-latency access
* Transactional workloads requiring frequent random reads and writes
* Running analytics databases like ClickHouse or TimescaleDB on EC2
* Storing Kafka logs or other streaming data that requires fast sequential writes
* Running Spark or Hadoop worker nodes where intermediate shuffle data needs fast local storage

**Limitations:**

* Single availability zone—if the AZ fails, the volume is unavailable until recovery
* Typically attached to one instance, limiting concurrent access
* Costs scale linearly with provisioned capacity, regardless of actual usage
* Manual management of capacity—you must provision and resize volumes explicitly

***

### **File Storage**

File storage adds a hierarchical namespace on top of block storage, organizing data into directories and files. It provides shared access through network protocols, allowing multiple clients to read and write simultaneously.

<figure><img src="https://2332658533-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FG5fhKjYnbaQlTPTcaO85%2Fuploads%2F6mqfUd2RmkCETeHx73oG%2Funnamed%20(3).jpg?alt=media&#x26;token=37da1c51-b65e-4ff4-b943-b42dc97ea058" alt=""><figcaption></figcaption></figure>

**Characteristics:**

* Presents a POSIX-compliant file system interface
* Supports concurrent access from multiple compute instances
* Handles file locking, permissions, and metadata automatically
* Scales elastically—capacity grows and shrinks based on actual usage
* Maintains file system semantics (directories, symbolic links, permissions)

**AWS Example: Amazon EFS (Elastic File System)**

* Automatically scales from gigabytes to petabytes
* Accessible from multiple EC2 instances simultaneously across multiple AZs
* Supports NFS v4 protocol
* Offers different storage classes (Standard, Infrequent Access)
* Provides lifecycle management to move files between storage classes

**Use Cases for Data Engineering:**

* Shared configuration files and scripts across a cluster of processing nodes
* Home directories for user workspaces in notebook environments (JupyterHub, Zeppelin)
* Shared libraries and dependencies for containerized applications
* Content management systems where multiple applications need access to the same files
* Machine learning training data that needs to be accessed by multiple training jobs simultaneously
* Centralized logging where multiple services write to shared log directories

**Limitations:**

* Higher latency compared to block storage due to network overhead
* More expensive per GB than object storage
* Performance can be inconsistent with highly concurrent workloads
* Not ideal for extremely high-throughput sequential reads/writes
* POSIX semantics can create bottlenecks at scale (directory listing operations, for example)

***

### **Object Storage**

Object storage treats data as discrete, self-contained objects rather than files or blocks. Each object consists of data, metadata, and a unique identifier. The architecture is designed for massive scalability and distributed access.

Each piece of data you store is called an Object. As shown in the diagram below, an object is made up of three key components:

* Data (Payload): The actual file content you are storing (e.g., an image, a document, a log file).
* Metadata: Customizable labels and tags that describe the object (e.g., `author=jane`, `project=website`, `retention=1year`). This makes it easy to search and manage your data.
* Unique ID (Key): A unique identifier, similar to a URL, that you use to retrieve the object (e.g., `my-bucket/photos/vacation/beach.jpg`).

You interact with this storage pool over the internet using simple API commands like `PUT` (to upload), `GET` (to download), and `DELETE`.

<figure><img src="https://2332658533-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FG5fhKjYnbaQlTPTcaO85%2Fuploads%2FVMGpzXe0ULaym8CustMb%2FGemini_Generated_Image_flfpqkflfpqkflfp.png?alt=media&#x26;token=fc8d912e-a44e-493b-8ba1-80ee2cf4ec85" alt=""><figcaption></figcaption></figure>

**Characteristics:**

* Flat namespace—objects are identified by keys rather than hierarchical paths (though keys can simulate paths)
* Objects are immutable—modifications require uploading a new version
* Stores unlimited amounts of data with virtually unlimited objects
* Highly durable (11 nines of durability for S3)
* Eventually consistent for overwrites and deletes (though S3 now offers strong consistency)
* Rich metadata support—arbitrary key-value pairs per object
* Accessible via HTTP/HTTPS APIs rather than file system protocols

**AWS Example: Amazon S3 (Simple Storage Service)**

* Organizes objects into buckets (top-level containers)
* Supports versioning, lifecycle policies, and event notifications
* Multiple storage classes (Standard, Intelligent-Tiering, Glacier, Deep Archive)
* Integrates with virtually every AWS service
* Offers features like S3 Select for querying data without downloading

**Use Cases for Data Engineering:**

* Data lakes—storing raw data in various formats (CSV, JSON, Parquet, Avro)
* Long-term archival and backup storage
* Storing results from batch processing jobs (Spark output, ETL results)
* Serving as source/destination for data pipelines (Airflow, Glue, Step Functions)
* Hosting static datasets for analytics and machine learning
* Storing intermediate results between pipeline stages
* Log aggregation and retention
* Serving as a staging area for data warehouse loading (Redshift, Snowflake)
* Immutable audit trails and compliance data

**Advanced S3 Features for Data Engineers:**

* **S3 Select**: Query CSV/JSON/Parquet files using SQL without downloading entire objects
* **Partitioning**: Organize data using key prefixes (e.g., `s3://bucket/year=2025/month=01/day=15/data.parquet`)
* **Storage Classes**: Automatically transition data between tiers based on access patterns
* **Batch Operations**: Perform operations on billions of objects (copying, tagging, restoring)
* **Event Notifications**: Trigger Lambda functions or SQS messages when objects are created/deleted
* **Transfer Acceleration**: Speed up uploads using CloudFront edge locations
* **Requester Pays**: Let data consumers pay for transfer costs

**Limitations:**

* Higher latency than block/file storage—typically 100-200ms for first byte
* Must read/write entire objects—no partial updates (though multipart upload helps)
* LIST operations can be slow with millions of objects
* Not suitable for transactional workloads or databases
* Limited query capabilities without additional tools (Athena, Presto, Spark)
* Network bandwidth becomes a bottleneck for frequent access patterns

#### **Object Storage Versioning**

One of the most powerful features of object storage is **versioning**. When you enable versioning on a bucket, the system keeps a history of every object.

When you update an object (by uploading a new file with the same key), it doesn't overwrite the old data. Instead, it creates a new version and makes it the current one. The old version is preserved, and you can still access it by specifying its unique version ID.

This is crucial for data protection. It allows you to recover from accidental deletions or overwrites. The diagram below illustrates this concept, showing a single logical object (`image.jpg`) with a history of three distinct versions.

<figure><img src="https://2332658533-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FG5fhKjYnbaQlTPTcaO85%2Fuploads%2FXrrv3adCDX1d9nJZWizd%2FGemini_Generated_Image_flfpqkflfpqkflfp%20(2).png?alt=media&#x26;token=9eb9f987-68f3-4089-b8c6-ad5176263afe" alt=""><figcaption></figcaption></figure>

In this diagram, a `GET` request for `image.jpg` returns the current version (Version 3). However, a request for `image.jpg?versionId=v1` would retrieve the original file (Version 1). The diagram also shows that even after a 'DELETE' operation, the deleted version is still stored and can be recovered if needed.

***

### **Hybrid and Specialized Storage Options**

**Amazon FSx Family:**

* **FSx for Lustre**: High-performance parallel file system for HPC and ML training workloads. Integrates with S3 for seamless data access. Offers throughput up to hundreds of GB/s.
* **FSx for Windows File Server**: Fully managed Windows file servers with SMB protocol support
* **FSx for NetApp ONTAP**: Enterprise NAS with advanced data management features
* **FSx for OpenZFS**: File storage built on OpenZFS with snapshots and cloning

**Use case**: When EFS doesn't provide enough performance, FSx for Lustre is ideal for ML training, genomics processing, or financial modeling requiring massive parallel throughput.

**AWS Storage Gateway:**

* Hybrid cloud storage connecting on-premises environments to AWS
* File Gateway: Presents S3 as NFS/SMB file shares
* Volume Gateway: Presents S3 as iSCSI block storage
* Tape Gateway: Virtual tape library for backup applications

**Use case**: Gradually migrating on-premises data workloads to cloud while maintaining local access patterns.

**Amazon S3 on Outposts:**

* Run S3 object storage on-premises with the same APIs
* Useful for data residency requirements or low-latency local access

***

#### **Performance Comparison**

Here's the performance comparison table:

| Storage Type        | Latency         | Throughput         | IOPS         | Concurrency        | Cost (relative) |
| ------------------- | --------------- | ------------------ | ------------ | ------------------ | --------------- |
| **Block (EBS io2)** | <1ms            | 1,000+ MB/s        | 64,000+      | Single instance    | High            |
| **Block (EBS gp3)** | Single-digit ms | 125-1,000 MB/s     | 3,000-16,000 | Single instance    | Medium          |
| **File (EFS)**      | Low ms          | Bursts to 10+ GB/s | Varies       | Multiple instances | Medium-High     |
| **Object (S3)**     | 100-200ms       | Parallel scaling   | N/A          | Unlimited          | Low             |
| **FSx Lustre**      | Sub-ms          | Hundreds of GB/s   | Millions     | Multiple instances | High            |

**Key Notes:**

* **Latency**: Time to access first byte of data
* **Throughput**: Maximum data transfer rate (sequential operations)
* **IOPS**: Input/Output Operations Per Second (random access operations)
* **Concurrency**: Number of clients/instances that can access simultaneously
* **Cost**: Relative pricing per GB stored and per operation

**For Data Engineering Context:**

* Use **Block (io2)** for high-performance databases requiring consistent low latency
* Use **Block (gp3)** for general-purpose workloads with balanced performance/cost
* Use **File (EFS)** when multiple workers need shared access to the same datasets
* Use **Object (S3)** for data lakes and long-term storage where cost optimization matters
* Use **FSx Lustre** for ML training or HPC workloads requiring extreme parallel throughput

***

### **Choosing the Right Storage**

**Use Block Storage (EBS) when:**

* Running databases that need low-latency random access
* Operating systems and boot volumes
* Real-time analytics requiring fast local storage
* Single-instance applications with high IOPS requirements

**Use File Storage (EFS/FSx) when:**

* Multiple instances need shared access to the same data
* You need POSIX file system semantics
* Running containerized applications with shared configuration
* ML training jobs accessing shared datasets

**Use Object Storage (S3) when:**

* Building data lakes with diverse data types
* Long-term retention and archival
* Integrating with AWS analytics services (Athena, EMR, Glue)
* Storing immutable data like logs, backups, or compliance records
* Cost optimization is critical and access patterns allow higher latency

**Data Engineering Pattern: The Three-Tier Approach**

Many data engineering architectures use all three:

1. **S3** for raw data ingestion, data lake storage, and long-term retention
2. **EBS** for database storage (Postgres, Airflow metadata) and temporary processing
3. **EFS** for shared resources (DAGs, scripts, libraries) across compute clusters

For example, a typical Spark pipeline might:

* Read source data from **S3** (cost-effective, scalable)
* Use **EBS** volumes for local shuffle data during processing (fast random I/O)
* Share UDF libraries and configuration via **EFS** (accessible to all workers)
* Write results back to **S3** in optimized formats like Parquet

This layered approach optimizes for both cost and performance across different stages of the data lifecycle.

***
