Indexing strategy — SQL Advanced Module 3 | LeadAfrik Public Economics Hub

Indexes are the single most-leveraged tool in performance work. They also have real costs — write amplification, storage, planner overhead. An advanced practitioner picks indexes deliberately, not defensively.

B-tree: the default

B-trees serve equality, range, ORDER BY, and prefix matches. 95% of indexes you'll create are B-tree. Postgres / MySQL / SQL Server all default to B-tree when you write CREATE INDEX without a type.

Composite indexes: leading-column rules

sql

-- Composite index on (customer_id, order_date, status)
CREATE INDEX idx_orders_cd_od_s ON orders (customer_id, order_date, status);

Serves filters on customer_id alone
Serves filters on customer_id + order_date (any range)
Serves filters on customer_id + order_date + status
Does NOT efficiently serve filters on order_date alone, status alone, or order_date + status without customer_id

The leading-column rule

A composite index is usable from the leftmost column outward. Skipping a leading column means the index can't be used for that filter. Always put the column with the most selective equality predicates first.

Partial indexes — small and fast

sql

-- Only ~0.5% of orders are pending; full index on status is wasteful.
CREATE INDEX idx_pending_orders ON orders (created_at) WHERE status = 'pending';

A partial index covers only the rows matching the WHERE clause. Two conditions must both be true: (a) the query always includes the predicate, and (b) the predicate is selective. The index is much smaller, fits more of itself in cache, and is faster to read.

Common partial-index patterns: WHERE deleted_at IS NULL for soft-delete tables, WHERE status IN ('pending', 'processing') for queue-like tables, WHERE active = true for any 'most rows inactive' pattern.

Covering indexes (INCLUDE)

sql

-- Index for both filter and SELECT columns → index-only scan
CREATE INDEX idx_orders_cover ON orders (customer_id) INCLUDE (order_date, amount);

A covering index includes both the filter columns (in the key) and the SELECT columns (via INCLUDE). The engine can answer the entire query from the index alone — no heap fetches. This is the fastest possible read pattern.

When INCLUDE pays off

Use INCLUDE for hot read paths where the query is well-known: dashboards, top-N lookups, foreign-key checks. Don't over-include — each extra column bloats the index and slows writes.

GIN, GiST, BRIN

GIN — for arrays, JSONB, full-text search. 'Inverted' index: maps element → row. Bigger than B-tree but matches efficiently
GiST — for geometric data, ranges, custom types. Generalised tree structure
BRIN — block range index. Tiny; good for very large tables where physical order correlates with the indexed column (time-series fact tables). Sub-1% the size of B-tree

The write-amplification cost

Every index is a tax on writes. Every INSERT, UPDATE that touches an indexed column, and DELETE updates the indexes. On a write-heavy table with 10 indexes, an INSERT does 11 writes. The tradeoff is: read perf gained vs write perf lost.

Indexes you don't use are pure cost

Check pg_stat_user_indexes (Postgres) or sys.dm_db_index_usage_stats (SQL Server) for indexes with zero scans over the last 30 days. They're slowing writes and consuming RAM/disk for no benefit. Drop them.

Exercise

Your orders table has 50M rows and the following queries are common. Design the minimal set of indexes that serves them well. (a) SELECT * FROM orders WHERE customer_id = ? AND order_date > ? (b) SELECT * FROM orders WHERE status = 'pending' (c) SELECT customer_id, SUM(amount) FROM orders WHERE order_date BETWEEN ? AND ? GROUP BY customer_id (d) SELECT * FROM orders WHERE order_id = ? Assume order_id is already the primary key.