Anti-patterns

In Data Engineering, a "danger" in SQL isn't just about code that crashes; it’s about code that produces incorrect results or destroys performance as your data grows.

Here are the most critical SQL anti-patterns for your notes.

The `NOT IN` with NULLs Trap

As you mentioned, NOT IN (SELECT ...) is dangerous because of how SQL handles Three-Valued Logic (True, False, Unknown).

The Problem: If the subquery returns even a single NULL value, the entire NOT IN condition evaluates to UNKNOWN. As a result, the query will return zero rows, even if there are clearly matches.
The Fix: Use NOT EXISTS or a LEFT JOIN ... WHERE ... IS NULL. These are "null-safe" and often faster.

-- DANGEROUS: If any city_id in 'inactive_cities' is NULL, you get 0 results.
SELECT * FROM users 
WHERE city_id NOT IN (SELECT city_id FROM inactive_cities);

-- SAFE: Handles NULLs correctly and is more performant.
SELECT * FROM users u
WHERE NOT EXISTS (
    SELECT 1 FROM inactive_cities ic 
    WHERE ic.city_id = u.city_id
);

Functions on Filtered Columns (SARGability)

SARGable stands for Search ARgument ABle. If you wrap a column in a function inside a WHERE clause, the database cannot use an index on that column. It has to perform a Full Table Scan.

The Problem: Using YEAR(), TRUNC(), or UPPER() on the column side of the equation.
The Fix: Move the math/function to the value side so the column remains "naked."

-- ANTI-PATTERN: Database must calculate the year for every single row.
SELECT * FROM sales WHERE YEAR(created_at) = 2024;

-- OPTIMIZED: The 'created_at' column is "naked," allowing index usage.
SELECT * FROM sales 
WHERE created_at >= '2024-01-01' AND created_at < '2025-01-01';

SELECT * (The "Lazy" Query)

In a Data Engineering pipeline, SELECT * is a major liability.

The Problem:
- 1. I/O Waste: You pull every column (including heavy text or JSON fields) even if you only need two. This breaks Projection Pushdown.
  2. Schema Fragility: If the source table adds a new column, your downstream views or Python scripts might break due to unexpected data.
The Fix: Explicitly name every column you need. It documents the code and saves memory.

Ordering by Column Position

Using integers in ORDER BY or GROUP BY (e.g., ORDER BY 1, 2) is common in quick analysis but dangerous in production code.

The Problem: If someone adds a new column to the SELECT statement later, your "1" or "2" now refers to the wrong data, and your report or logic silently sorts by the wrong field.
The Fix: Use the actual column names or aliases.

Correlated Subqueries in SELECT

A correlated subquery is a query inside the SELECT clause that refers to the outer query.

The Problem: The database often has to run that inner query once for every single row in the outer table. If you have 1 million rows, you just ran 1 million extra queries.
The Fix: Use a LEFT JOIN or a CTE to pre-calculate the values.

-- ANTI-PATTERN: Runs once per row.
SELECT 
    user_name,
    (SELECT MAX(created_at) FROM orders WHERE orders.user_id = users.id) as last_order
FROM users;

-- OPTIMIZED: Join the data once.
SELECT u.user_name, MAX(o.created_at)
FROM users u
LEFT JOIN orders o ON u.id = o.user_id
GROUP BY u.user_name;

Summary

Anti-Pattern

Primary Risk

The Fix

NOT IN

Incorrect/Zero results

NOT EXISTS

Functions on Columns

Performance (No Index)

Keep column "naked"

SELECT *

Memory bloat / Broken code

Name specific columns

Correlated Subquery

Slow execution (N+1)

Use JOIN

PreviousStored Procedures NextWorking with JSON data

Last updated 18 days ago

hashtagThe NOT IN with NULLs Trap

hashtagFunctions on Filtered Columns (SARGability)

hashtagSELECT * (The "Lazy" Query)

hashtagOrdering by Column Position

hashtagCorrelated Subqueries in SELECT

hashtagSummary