Indexing


In the world of Data Engineering, an Index is a separate data structure (usually a B-Tree) that allows the database to find specific rows without scanning every single page of a table.

Think of it like the index at the back of a textbook: instead of reading the whole book to find "Indexing," you look up the word and jump straight to the page number.


How to Create Indexes

The syntax for creating a standard index is straightforward. You specify the name of the index, the table, and the column(s) you want to speed up.

-- Basic Index
CREATE INDEX idx_user_id ON orders(user_id);

-- Unique Index (also enforces data integrity)
CREATE UNIQUE INDEX idx_order_number ON orders(order_number);

-- Composite Index (speeds up queries filtering on multiple columns)
CREATE INDEX idx_cust_date ON orders(customer_id, order_date);

Types of Indexes

Depending on your database "flavor" (Postgres, MySQL, SQL Server), you have several specialized structures:

  • B-Tree (Balanced Tree): The default for almost all databases. Excellent for equality (=) and range queries (>, <).

  • Bitmap Index: Often used in Data Warehousing for columns with low "cardinality" (few unique values, like gender or boolean_flag).

  • Hash Index: Extremely fast for exact matches (=) but useless for range scans or sorting.

  • GIST/GIN: Used for complex data types like Full-Text Search, JSONB, or Geographic coordinates.


When to Add an Index

Adding indexes isn't free—they take up disk space and slow down write operations. Only add them if:

  • The column is frequently used in WHERE clauses.

  • The column is used for JOIN conditions.

  • The column is used in ORDER BY or GROUP BY statements.

  • The table is large (millions of rows) and you frequently perform selective lookups.


The "Downside" of Indexing

For every index you create, the database has more work to do during DML operations (INSERT, UPDATE, DELETE).

  1. Write Overhead: Every time you add a row, the database must also update the index "map."

  2. Storage: Large indexes can sometimes take up as much space as the table itself.

  3. Maintenance: Over time, indexes can become fragmented and may need to be "Reindexed" to stay fast.


Indexing Anti-patterns

  • Indexing every column: This is the most common mistake. It kills write performance and bloats your database.

  • Indexing low-cardinality columns: Using a B-Tree index on a Gender column (where 50% of rows are the same) is usually slower than just scanning the table.

  • Redundant Indexes: If you have a composite index on (city, state), you do not need a separate index on just city. The database can use the first part of the composite index for that.


Summary Table: Index vs. No Index

Query Type

Without Index

With Index

SELECT *

No difference

No difference

WHERE id = 500

Full Table Scan (Slow)

Index Seek (Instant)

INSERT INTO...

Fast

Slower (Index must update)

ORDER BY date

Manual Sort in RAM

Pre-sorted via Index



Last updated