# Partitioning & Clustering

***

### Why is the cost of partitioning known upfront?

The reason BigQuery can tell you the cost of partitioning upfront (during a dry run) is that it relies on Physical Separation and Static Metadata.

Unlike clustering, which sorts data *within* files, partitioning physically splits the table into distinct segments.

#### The Role of Metadata

When you create a partitioned table, BigQuery maintains a "metadata manifest" for that table. This manifest stores:

* The list of all partitions (e.g., `20240101`, `20240102`).
* The total size in bytes of each individual partition.

Because this metadata is updated every time data is loaded or changed, BigQuery doesn't need to look at the actual data to know how big a specific day is.

#### How the "Dry Run" Calculates Cost

When you type a query into the console and see the *"This query will process X bytes"* message, here is what happens behind the scenes:

1. Parsing: BigQuery looks at your `WHERE` clause for the partition column (e.g., `WHERE date_col = '2024-01-01'`).
2. Pruning: It identifies exactly which partitions match that filter.
3. Metadata Lookup: It queries its own internal metadata table for the size of those specific partitions.
4. Summation: It adds up the bytes of the selected columns within those partitions and presents the total to you.

#### When the "Upfront Cost" Fails

There are specific scenarios where the cost cannot be known upfront even with partitioning. This happens when the partition filter is dynamic:

* Subqueries: If you write `WHERE date_col = (SELECT MAX(date_col) FROM table)`, BigQuery has to run the subquery first to find the date. It cannot "prune" the partitions during the dry run phase, so it assumes a full table scan for the estimate.
* Non-Deterministic Functions: Using functions that change (like `CURRENT_DATE`) can sometimes prevent the query validator from giving a perfect estimate, though BigQuery has become much better at resolving these constant expressions recently.

***

| **Feature**  | **Known Upfront?** | **Why?**                                                           |
| ------------ | ------------------ | ------------------------------------------------------------------ |
| Partitioning | Yes                | Uses pre-calculated metadata for physical table segments.          |
| Clustering   | No                 | Sorting is internal to blocks; pruning happens *during* execution. |

***

### Partitioning vs. Clustering

In BigQuery, **Partitioning** and **Clustering** are the two primary weapons for reducing query costs and improving performance. Think of them as two layers of organization: one that slices the table into big chunks and one that sorts the rows inside those chunks.

The key difference is how they physically divide your data.

| **Feature**  | **Partitioning**                                                        | **Clustering**                                                          |
| ------------ | ----------------------------------------------------------------------- | ----------------------------------------------------------------------- |
| **Concept**  | Divides table into "slices" based on a column (e.g., Date).             | Sorts data "inside" each partition based on up to 4 columns.            |
| **Analogy**  | The Library Floor: You go to the 2024 floor and ignore all other years. | The Bookshelf: Once on the 2024 floor, books are alphabetized by Title. |
| **Best For** | Time-series data or broad integer ranges.                               | High-cardinality columns (many unique values like `user_id`).           |
| **Limit**    | Exactly one column per table (Max 4,000 partitions).                    | Up to four columns per table.                                           |
| **Cost**     | Known before you run the query (via dry run).                           | Known only after the query finishes.                                    |

***

#### Why is Clustering cost not known upfront?

In a regular or partitioned table, BigQuery uses metadata to know exactly which files (blocks) it needs to touch before the query even starts. Because partitions are distinct physical segments, BigQuery can calculate the exact "bytes to be scanned" during a dry run.

Clustering is different:

* Block Pruning: BigQuery organizes data into small blocks. Unlike partitions, these blocks don't have rigid, pre-defined boundaries.
* Dynamic Search: As the query runs, BigQuery looks at the "min/max" values of the clustered columns in each block. If your filter (e.g., `WHERE user_id = '123'`) doesn't fall in that range, it skips the block.
* The Catch: BigQuery only discovers if it can skip a block *while it is actually executing the query*. Because it can't predict exactly how many blocks will be pruned without looking at the internal block metadata during execution, it cannot give you an accurate cost estimate beforehand.

***

#### What is Automatic Reclustering?

In BigQuery, data is "best-effort" sorted. As you stream new data or perform DML operations (inserts/updates), the perfect sorting of your clustered columns begins to "weaken." New data might be appended in a messy, unsorted way, creating overlapping ranges.

Automatic Reclustering is BigQuery’s background "Cleanup Fairy":

* Background Process: It periodically scans your clustered tables and re-sorts the data to restore the optimal sort order.
* Zero Cost to You: Unlike some other warehouses (like Snowflake), Google does not charge you for the compute used for automatic reclustering. You only pay for the storage of the newly organized data.
* Seamless: It happens in the background without impacting your query performance or availability.

***

#### Summary Checklist

* Use Partitioning when you want strict cost control and your data naturally fits into a time or integer range.
* Use Clustering on top of partitioning when you frequently filter by specific IDs or categories within those time ranges.
* Combine them to get "Double Pruning": BigQuery first throws away the wrong dates (Partitioning), then throws away the wrong user IDs within the correct date (Clustering).

***
