Materialised views and partitioning — SQL Advanced Module 7

Materialised views and partitioning are the two main tools for managing tables that have outgrown their query budget. MVs trade staleness for read speed. Partitioning trades schema complexity for I/O scalability and lifecycle ease.

Materialised views: when they pay off

A materialised view (MV) is a query whose result is stored as data on disk. Subsequent reads serve from the stored data; the underlying query doesn't re-execute. When the upstream tables change, the MV is stale until REFRESH'd.

Read pattern is hot (queried many times) and write pattern on underlying tables is moderate (refresh cost is amortised)
Underlying query is expensive (multi-join with aggregation, window functions, full table scans)
Consumers tolerate the staleness window — typically minutes to hours
You DON'T have a unique-key dependency that makes the MV inappropriate as a source for further joins (most engines allow indexes on MVs, which fixes this)

Refresh strategies

Manual REFRESH — you call REFRESH MATERIALIZED VIEW name; whenever needed. Right for batch / ETL contexts
Scheduled — cron-style refresh every N minutes. Right for dashboards where 'a few minutes stale' is fine
REFRESH CONCURRENTLY (Postgres) — refresh without blocking reads. Requires a unique index on the MV. Slower but doesn't lock readers out
Incremental refresh — only re-compute the rows that changed. Postgres lacks this natively; Materialize, ClickHouse, Snowflake (dynamic tables) all have varying levels of support

sql

-- Build a materialised daily revenue summary
CREATE MATERIALIZED VIEW daily_revenue AS
SELECT DATE(order_date) AS day,
  country,
  COUNT(*) AS orders,
  SUM(amount) AS revenue
FROM orders
JOIN customers USING (customer_id)
GROUP BY DATE(order_date), country;

-- Unique index enables REFRESH CONCURRENTLY
CREATE UNIQUE INDEX idx_daily_revenue_pk ON daily_revenue (day, country);

-- Non-blocking refresh
REFRESH MATERIALIZED VIEW CONCURRENTLY daily_revenue;

Partitioning: range / list / hash

Partitioning splits one logical table into multiple physical tables based on a partition key. Queries that filter on the key get 'partition pruning' — the engine touches only the relevant partitions.

Range partitioning — by a date or numeric range. Most common pattern: events_2025_01, events_2025_02, etc.
List partitioning — by a discrete value. e.g., orders_kenya, orders_uganda, orders_tanzania
Hash partitioning — by hash of a column. Right for even distribution across N partitions when there's no natural range or list

sql

-- Postgres range partitioning by date
CREATE TABLE events (
  id BIGSERIAL,
  event_date DATE NOT NULL,
  user_id BIGINT,
  payload JSONB
) PARTITION BY RANGE (event_date);

CREATE TABLE events_2025_01 PARTITION OF events FOR VALUES FROM ('2025-01-01') TO ('2025-02-01');
CREATE TABLE events_2025_02 PARTITION OF events FOR VALUES FROM ('2025-02-01') TO ('2025-03-01');
-- … one partition per month

-- Query with partition pruning — only touches Feb partition
SELECT COUNT(*) FROM events WHERE event_date BETWEEN '2025-02-05' AND '2025-02-12';

Why partition at all

Query speed — pruning means scanning O(weeks) rows instead of O(years)
Maintenance speed — DROP TABLE events_2024_01 is O(1) and instant; DELETE FROM events WHERE event_date < … is O(N) and locks rows
Index size — indexes are per-partition. Smaller indexes fit better in memory; each per-partition index is independently tuneable
Backup/restore granularity — partitions can be archived to colder storage, restored independently

Partitioning isn't free

The schema complexity is real. Cross-partition queries are slower (no pruning), unique constraints across all partitions require the partition key in the constraint, and the operational overhead of partition creation/drop usually needs automation (pg_partman, Postgres native auto-partitioning, or a cron job). Don't partition tables under ~100M rows; the maintenance cost outweighs the gain.

Exercise

Your events table is 400M rows and growing 10M/month. Queries are 90% 'last 7 days' filters, 10% historical analytics. Reads peak at 200 QPS during business hours. (1) Decide between (a) just adding more indexes, (b) building a materialised view of the last 7 days, (c) partitioning by event_date. (2) Justify. (3) What's the operational discipline that has to come with your choice?