Snowflake

Lecture from Andy Pavlo about Snowflake internals

The Elevator Pitch

Snowflake is a Software-as-a-Service (SaaS) data platform built entirely for the cloud. Unlike traditional data warehouses (which were on-premise) or early cloud adaptations (which were just ported servers), Snowflake was architected from scratch to leverage the infinite scale of the cloud.

Cloud Agnostic: It runs on top of AWS, Azure, or Google Cloud (GCP), but it abstracts the underlying infrastructure away from you. Because of this, it makes use of the cloud's inherent benefits such as high availability, scalability, elasticity, and durability.
Single Source of Truth: It aims to be a single platform for Data Warehousing, Data Lakes, Data Engineering, and Data Science.

The fact that it's a SaaS also means that it:

It doesn't require management of hardware
The user performs no manual upgrades or patches to software. There are weekly online updates and patches which are completely transparent to the user and cause no visible downtime.
Flexible subscription model. Only pay for what and how much you use.
Ease of access via UI, CLI, or programmatically
Optimizations like partitioning of data is done automatically as part of the loading process

The Architecture: The "Three-Layer Cake"

This is the most critical concept to understand. Snowflake separates its functions into three distinct layers that scale independently.

Layer 1: Data Storage (The Disk)

What it does: Stores your data.
How it works: When you load data, Snowflake converts it into its own optimized, compressed columnar format (called Micro-partitions). The micro-partitions are of about 50-500MB each.
- The data in these micro-partitions is encrypted, compressed, and converted into Snowflake's proprietary columnar file format thereby enabling better performance when make OLAP queries on data.
Where it lives: In the cloud provider's blob storage (e.g., AWS S3, Azure Blob), but you don't manage this directly, you only choose the provider. Snowflake manages it.
Key Trait: It is infinitely scalable and quite cheap.

Layer 2: Query Processing (The Muscle)

What it does: Executes your queries (SELECT, INSERT, UPDATE).
How it works: It uses "Virtual Warehouses" (huge clusters of compute resources). You can spin these up instantly.
- You can create virtual warehouse object using SQL. Behind the scenes, Snowflake will allocate one of the compute clusters to the user or spin up a new one. They're responsible for performing the computation required to process a SQL query using the table data in the centralized storage layer.
Key Trait: This is Elastic. You can have a "Small" cluster for reporting and a "4X-Large" cluster for heavy data loading running simultaneously. They never compete for resources because they are physically separate clusters looking at the same data (Layer 1).

When we issue a SQL query, we set which virtual warehouse to use. This virtual warehouse will then make a remote call to the storage layer and retrieve the raw table data, store it on a local high-speed cache, and compute the results.

CREATE OR REPLACE WAREHOUSE data_eng_wh
WITH
  WAREHOUSE_SIZE = 'X-SMALL'
  AUTO_SUSPEND = 60
  AUTO_RESUME = TRUE
  INITIALLY_SUSPENDED = TRUE
  COMMENT = 'Warehouse for ETL and Data Engineering tasks';

Any standard data processing query (SELECT, INSERT, UPDATE, DELETE) utilizes a Virtual Warehouse.

In Snowflake, you must explicitly tell the system which computer (Virtual Warehouse) you want to use to run your query.

Here is a typical workflow:

-- 1. Select the "muscle" (Virtual Warehouse) you want to use
-- This spins up the cluster (or uses an active one)
USE WAREHOUSE compute_wh;

-- 2. Run a query that requires computation
-- This forces the Warehouse to fetch data from Storage (Layer 1) 
-- and process it in memory (Layer 2)
SELECT 
    c.region,
    SUM(o.total_amount) as total_sales
FROM orders o
JOIN customers c ON o.customer_id = c.id
WHERE o.order_date >= '2023-01-01'
GROUP BY c.region;

Layer 3: Cloud Services (The Brain)

What it does: The "control plane" that manages everything else.
Duties: Authentication, security, metadata management (knowing where data is stored), query optimization, and transactions.
Key Trait: You don't configure this. It just works. It ensures ACID transactions and security.

The "Secret Sauce": Separation of Storage & Compute

In traditional databases (like SQL Server or old Postgres), the disk and the CPU were tied together in one box. If you needed more storage, you often had to buy more CPU, and vice versa.

In Snowflake:

Storage is decoupled from Compute.
Benefit: You can store Petabytes of historical data (cheap storage) without paying for a massive server to keep it running. You only pay for the Compute (Virtual Warehouse) when you are actually running a query.
Concurrency: Marketing can run their BI dashboards on Warehouse A, while Data Engineers run massive ETL jobs on Warehouse B. Both access the same data, but neither slows the other down.

Killer Features

Snowflake has several "quality of life" features that make it popular:

Time Travel: You can query data as it looked in the past.
- Example: SELECT * FROM my_table AT(OFFSET => -10 minutes); (Undo a bad delete command instantly).
Zero-Copy Cloning: You can clone a massive production database into a "Dev" environment in seconds. It doesn't actually copy the data (saving storage costs); it just creates new metadata pointers.
Snowpipe: An event-driven ingestion tool that loads data the moment files land in your cloud storage (S3/Azure Blob).
Data Sharing: You can share live tables with other Snowflake users (even outside your company) without copying files via FTP or APIs.

How You Interact With It

SQL: Snowflake uses standard ANSI SQL. If you know SQL, you know 95% of Snowflake.
Snowsight: The modern web UI for writing queries, viewing dashboards, and managing admin tasks.
SnowSQL: The command-line interface (CLI).
Drivers: Connects easily with Python, Go, Node.js, JDBC/ODBC, Spark, etc.

Summary Table

Feature

Traditional DW

Snowflake

Infrastructure

Hardware/Servers

100% SaaS (Serverless)

Scaling

Difficult (add nodes)

Instant (resize/suspend clusters)

Concurrency

Queries fight for resources

Workloads are isolated

Maintenance

Indexing, Vacuuming, Partitioning

Automated (mostly)

Pricing

Fixed Capacity

Pay per second of usage

Snowflake Feature Store for ML

PreviousCloud Service Equivalents: AWS, GCP, and Azure NextStorage

Last updated 27 days ago

hashtagThe Elevator Pitch

hashtagThe Architecture: The "Three-Layer Cake"

hashtagThe "Secret Sauce": Separation of Storage & Compute

hashtagKiller Features

hashtagHow You Interact With It

hashtagSummary Table