How a SQL engine executes a query — SQL Advanced Module 1

Every SQL engine is doing the same three things behind your SELECT: parsing the text into a tree, planning the cheapest way to execute it, and running the plan. Advanced SQL starts when you stop treating that as a black box.

The parse → plan → execute lifecycle

Parse — the SQL text is tokenised and built into a syntax tree. Errors at this stage are syntax errors.
Rewrite — view definitions are expanded, simple algebraic simplifications are applied.
Plan — the optimiser enumerates candidate plans (different join orders, scan types, aggregation strategies) and assigns each a cost estimate based on table statistics. The cheapest plan wins.
Execute — the chosen plan runs. The engine streams rows through the operators (scan, join, sort, aggregate) and produces the result.

Cost-based optimisation

Modern engines (Postgres, MySQL, Snowflake, BigQuery, Trino) are cost-based — they compare candidate plans using estimated cost. Cost is a unit-less number combining I/O cost, CPU cost, and memory cost. The plan with the lowest cost is chosen.

Cost estimates depend on table statistics: row counts, column histograms, NDV (number of distinct values), null fractions, correlation between physical row order and column values. The planner reads these from pg_statistic (Postgres), information_schema.tables (MySQL), or equivalent catalogue tables.

Stale statistics = bad plans

If statistics say a table has 1,000 rows but it actually has 10M, the planner will think a sequential scan is cheap and skip the index. Symptom: a query that 'suddenly got slow' after a big data load. Fix: ANALYZE the table (Postgres) or ANALYZE TABLE (MySQL). Auto-vacuum/auto-analyze handles this in most cases, but heavy bulk loads can outrun it.

What the optimiser actually decides

Scan type per table: sequential scan, index scan, index-only scan, bitmap heap scan
Join order: A JOIN B JOIN C vs A JOIN (B JOIN C) etc. Theoretical N! combinations; the planner uses heuristics + dynamic programming
Join algorithm per pair: nested loop, hash join, merge join
Aggregation strategy: hash aggregate vs sort-then-aggregate (GroupAggregate)
Whether to use parallel workers (most modern engines)
Whether intermediate results spill to disk (sort, hash, aggregate spills)

The 'plan you got vs the plan you wanted' problem

When a query is slow, the question is rarely 'should I add more memory?' — it's 'why did the planner make a bad choice?'. The five usual suspects:

Stale statistics (most common) — ANALYZE
Skewed data — one value covers 90% of rows; histogram-based planning gets it wrong. Fix: extended statistics, query hints in some engines, or rewrite
Missing index — sequential scan when you wanted index lookup
Suboptimal join order — usually surfaces with a 10+ join query; planner heuristics give up. Fix: rewrite to influence order, or set join_collapse_limit (Postgres)
Parameter sniffing (SQL Server) / generic plans (Postgres) — a prepared statement gets a plan optimised for one parameter that's wrong for another. Fix: force a re-plan or use literal values

Exercise

A query that was fast yesterday is now timing out today. Nothing about the query changed. Walk through the diagnostic sequence — what do you check first, second, third? Why in that order?