Query Optimizations


In data engineering, you’ll inevitably encounter poorly performing queries. Knowing how to identify and fix these queries is invaluable. Don’t fight your database. Learn to work with its strengths and augment its weaknesses. This section shows various ways to improve your query performance.


Pushdown optimizations

Pushdown optimizations are all about moving the "work" as close to the data source as possible. Instead of pulling a massive amount of data into memory and then filtering it, the database filters the data while it's reading it from the disk.


Predicate Pushdown (Filtering Early)

A predicate is simply the condition in your WHERE clause (e.g., city = 'New York').

Predicate Pushdown moves these filters directly into the data scan process.

  • Without Optimization: The engine reads 10 million rows from the table into memory, then applies the filter to keep only 500 rows.

  • With Optimization: The engine tells the storage layer, "Only send me rows where city = 'New York'." Only 500 rows ever travel over the network or into memory.


Projection Pushdown (Selecting Less)

A projection is the list of columns you choose in your SELECT statement.

Projection Pushdown ensures the engine only reads the specific columns you asked for, rather than reading the entire row.

  • The Benefit: This is the primary reason why Columnar formats (like Parquet or Snowflake's internal storage) are so fast. If a table has 200 columns but you only select 2, the engine physically skips the other 198 columns on the disk.


Optimal Join Orderings

When you join three or more tables, the order in which the database joins them can drastically change the performance. The Query Optimizer tries to find the path that keeps the "intermediate" data as small as possible.

The Strategy: Usually, the optimizer tries to join the smallest, most filtered tables first.

  • If you join a tiny category table to a filtered sales table first, the resulting set is small.

  • If you join two massive fact tables first, you create a giant intermediate dataset that might crash the engine or slow it down significantly.


Other Key Optimizations

Limit Pushdown

If you have a query like SELECT * FROM table LIMIT 10, the optimizer tells the storage layer to stop reading as soon as it finds the first 10 rows, rather than reading the whole file.

Constant Folding

The optimizer simplifies expressions before running the query.

  • Your SQL: WHERE salary > 1000 * 12

  • Optimizer's Version: WHERE salary > 12000

  • This prevents the database from performing the multiplication for every single row.

Partition Pruning

This is perhaps the most important optimization for Big Data. If your data is partitioned by date and you write WHERE date = '2023-01-01', the engine completely ignores all folders or files that belong to other dates. It never even opens them.


Summary Table: Where is the work saved?

Optimization

What is saved?

Primary Benefit

Predicate Pushdown

Rows

Reduced network/memory usage.

Projection Pushdown

Columns

Reduced I/O (especially in Columnar formats).

Join Ordering

Intermediate RAM

Prevents "memory overflow" on large joins.

Partition Pruning

Files/Folders

Massive speed increase by skipping data.


Last updated