Partitioning


Table Partitioning

Partitioning allows you to split one logically large table into smaller physical pieces (partitions). While the table looks like a single entity to your queries, the database engine only scans the specific "slices" of data it needs.

Why use it in Data Engineering?

  • Partition Pruning: If a query filters on the partition key (e.g., WHERE event_date = '2024-01-01'), the engine ignores all other partitions, drastically reducing I/O.

  • Fast Deletion (Drop Partition): Deleting millions of rows with DELETE is slow and logs heavily. Dropping an entire partition is a metadata change that takes milliseconds.

  • Parallelism: Modern engines can scan different partitions simultaneously using multiple CPU cores.


Common Partitioning Methods

Range Partitioning

Data is assigned to partitions based on a range of values. This is most common for Time-Series data.

-- Partitioning by month
CREATE TABLE sales (
    order_id INT,
    order_date DATE
) PARTITION BY RANGE (order_date);

-- The database creates separate storage for each range
CREATE TABLE sales_y2024_m01 PARTITION OF sales
    FOR VALUES FROM ('2024-01-01') TO ('2024-02-01');

List Partitioning

Data is assigned based on a discrete list of values, such as Regions or Status Codes.

Hash Partitioning

The engine applies a hash function to the partition key (like user_id) to spread data evenly across a fixed number of partitions. This is used to avoid "Hot Spots" (where one partition gets 90% of the data).


Partitioning vs. Indexing

These are often confused, but they serve different roles in your optimization toolkit:

Feature

Partitioning

Indexing

Logic

Physical separation of data.

A separate map (B-Tree/Hash) to data.

Best For

Large "Bulk" scans (e.g., "All of May").

Precise lookups (e.g., "User #4421").

Maintenance

Great for "aging out" old data.

Can slow down INSERT operations.


Practical "Pro-Tip" for Partition Keys

When choosing a partition key, look for the column that appears most frequently in your WHERE clauses. If you partition by region but everyone queries by date, you lose all the benefits of Partition Pruning, as the engine will still have to check every single partition.


Last updated