Filtering: WHERE clause


The WHERE Clause

The WHERE clause is used to filter records before any grouping or aggregation occurs. It acts as a gatekeeper, determining which rows are pulled from the disk into memory.

Comparison Operators

Operator

Description

Example

=

Exact match

WHERE status = 'active'

!= or <>

Not equal to

WHERE category != 'test'

>, <, >=, <=

Range comparisons

WHERE price >= 100

BETWEEN

Inclusive range

WHERE age BETWEEN 18 AND 25

IN

Match any in a list

WHERE region IN ('US', 'CA', 'UK')

IS NULL

Check for empty values

WHERE email IS NULL


Advanced Pattern Matching

Data is often messy. You won't always have an exact match.

  • LIKE: Case-sensitive pattern matching.

    • % = Multi-character wildcard.

    • _ = Single-character wildcard.

  • ILIKE: (Postgres Specific) Case-insensitive version of LIKE.

  • SIMILAR TO: Uses SQL regular expressions.

  • ~: Uses POSIX Regular Expressions (the most powerful).


Handling NULLs: The Tri-state Logic

In SQL, NULL does not mean "zero" or "empty string"—it means unknown. This leads to Three-Valued Logic: TRUE, FALSE, and UNKNOWN.

The Trap: NULL = NULL is not TRUE. It is UNKNOWN.

To filter safely, you must use IS NULL or IS NOT NULL. Alternatively, use COALESCE to provide a default value during the filter.


Logical Operators & Precedence

You can combine filters using AND, OR, and NOT.

Crucial Note on Parentheses: AND has higher precedence than OR. Always use parentheses to avoid "Leaky Filters."


Performance: SARGable Queries

In Data Engineering, we care about SARGable (Search ARGumentable) queries. This means writing filters in a way that allows the Query Optimizer to use an index.

  • Non-SARGable: WHERE YEAR(created_at) = 2024 (The database has to run a function on every single row, forcing a Table Scan).

  • SARGable: WHERE created_at >= '2024-01-01' AND created_at < '2025-01-01' (The database can use a B-Tree Index on the created_at column directly).


Filtering with DISTINCT

While not technically a WHERE clause, DISTINCT is a filter that removes duplicate rows from your result set.

  • Postgres Superpower: DISTINCT ON (column). It returns the first row for each group of the specified column.


Summary

  • WHERE filters rows; HAVING filters groups.

  • COALESCE is vital for filtering columns that contain NULL.

  • Avoid functions on columns in the WHERE clause if you want to keep your queries fast (SARGable).

  • ILIKE and ~ (Regex) make Postgres much more flexible for string filtering than standard SQL.


Last updated