Query Optimizations
In data engineering, you’ll inevitably encounter poorly performing queries. Knowing how to identify and fix these queries is invaluable. Don’t fight your database. Learn to work with its strengths and augment its weaknesses. This section shows various ways to improve your query performance.
Pushdown optimizations
Pushdown optimizations are all about moving the "work" as close to the data source as possible. Instead of pulling a massive amount of data into memory and then filtering it, the database filters the data while it's reading it from the disk.
Predicate Pushdown (Filtering Early)
A predicate is simply the condition in your WHERE clause (e.g., city = 'New York').
Predicate Pushdown moves these filters directly into the data scan process.
Without Optimization: The engine reads 10 million rows from the table into memory, then applies the filter to keep only 500 rows.
With Optimization: The engine tells the storage layer, "Only send me rows where
city = 'New York'." Only 500 rows ever travel over the network or into memory.
Projection Pushdown (Selecting Less)
A projection is the list of columns you choose in your SELECT statement.
Projection Pushdown ensures the engine only reads the specific columns you asked for, rather than reading the entire row.
The Benefit: This is the primary reason why Columnar formats (like Parquet or Snowflake's internal storage) are so fast. If a table has 200 columns but you only select 2, the engine physically skips the other 198 columns on the disk.
Optimal Join Orderings
When you join three or more tables, the order in which the database joins them can drastically change the performance. The Query Optimizer tries to find the path that keeps the "intermediate" data as small as possible.
The Strategy: Usually, the optimizer tries to join the smallest, most filtered tables first.
If you join a tiny
categorytable to a filteredsalestable first, the resulting set is small.If you join two massive
facttables first, you create a giant intermediate dataset that might crash the engine or slow it down significantly.
Other Key Optimizations
Limit Pushdown
If you have a query like SELECT * FROM table LIMIT 10, the optimizer tells the storage layer to stop reading as soon as it finds the first 10 rows, rather than reading the whole file.
Constant Folding
The optimizer simplifies expressions before running the query.
Your SQL:
WHERE salary > 1000 * 12Optimizer's Version:
WHERE salary > 12000This prevents the database from performing the multiplication for every single row.
Partition Pruning
This is perhaps the most important optimization for Big Data. If your data is partitioned by date and you write WHERE date = '2023-01-01', the engine completely ignores all folders or files that belong to other dates. It never even opens them.
Summary Table: Where is the work saved?
Optimization
What is saved?
Primary Benefit
Predicate Pushdown
Rows
Reduced network/memory usage.
Projection Pushdown
Columns
Reduced I/O (especially in Columnar formats).
Join Ordering
Intermediate RAM
Prevents "memory overflow" on large joins.
Partition Pruning
Files/Folders
Massive speed increase by skipping data.
Last updated