Skip to content
Module 07 of 850 min readIntermediate

Cleaning and reshaping

Pivot/unpivot, deduplication with ROW_NUMBER, slowly-changing dimensions (SCD2).

88%

Listen along

Read “Cleaning and reshaping” aloud

Plays in your browser using on-device text-to-speech — nothing leaves the page.

Learning objectives

By the end of this module, you should be able to:

  • 01Pivot long → wide with CASE-based aggregation
  • 02Unpivot wide → long with UNION ALL or the LATERAL trick
  • 03De-duplicate using ROW_NUMBER + QUALIFY (or a self-filter)
  • 04Handle slowly-changing dimensions (SCD type 2) in a query

Real data is messy. Half of an analyst's SQL is shaping data into the form a downstream chart, model, or report can use. This module covers the four reshaping problems you'll meet weekly.

Pivot — long to wide

Long format: one row per (entity, attribute, value). Wide format: one row per entity, columns per attribute. Most BI tools want wide; most warehouses store long. The bridge is CASE-based aggregation.

sql
-- Long input: orders(order_id, status, count)
-- Want wide: order_id, pending, shipped, delivered, cancelled
SELECT order_id,
MAX(CASE WHEN status = 'pending' THEN count END) AS pending,
MAX(CASE WHEN status = 'shipped' THEN count END) AS shipped,
MAX(CASE WHEN status = 'delivered' THEN count END) AS delivered,
MAX(CASE WHEN status = 'cancelled' THEN count END) AS cancelled
FROM orders
GROUP BY order_id;

PIVOT clause where supported

Snowflake, DuckDB, BigQuery, and SQL Server have a native PIVOT. Postgres doesn't (use the CASE form or crosstab() from tablefunc). The CASE form works everywhere and is more flexible — it lets you compute different aggregates per pivoted column.

Unpivot — wide to long

Less common but happens — e.g., when ingesting a CSV with monthly columns. UNION ALL is the universal form:

sql
-- Input: revenue(customer_id, jan, feb, mar, apr)
-- Want long: customer_id, month, amount
SELECT customer_id, 'jan' AS month, jan AS amount FROM revenue UNION ALL
SELECT customer_id, 'feb', feb FROM revenue UNION ALL
SELECT customer_id, 'mar', mar FROM revenue UNION ALL
SELECT customer_id, 'apr', apr FROM revenue;

Postgres has UNNEST + arrays for a cleaner version; SQL Server has UNPIVOT; Snowflake / DuckDB have UNPIVOT too. UNION ALL is the portable fallback.

De-duplication

A dataset has multiple rows per logical entity and you want one. ROW_NUMBER + a tiebreaker is the standard pattern:

sql
-- Keep the most recent row per customer
WITH ranked AS (
SELECT *,
ROW_NUMBER() OVER (
PARTITION BY customer_id
ORDER BY updated_at DESC, id DESC -- updated_at first, id as tiebreaker
) AS rn
FROM customers_raw
)
SELECT * FROM ranked WHERE rn = 1;

Snowflake / DuckDB / BigQuery / Trino let you skip the CTE with QUALIFY: SELECT * FROM customers_raw QUALIFY ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY updated_at DESC) = 1.

Slowly-changing dimensions (SCD type 2)

SCD2 dimensions store history: each row has a valid_from and valid_to. The challenge is joining a fact event to the version of the dimension that was current at the event time.

sql
-- customer_dim(customer_id, name, plan, valid_from, valid_to)
-- orders(order_id, customer_id, order_date, amount)
-- We want each order joined to the customer's plan AT THE TIME of the order.
SELECT o.order_id, o.amount, d.name, d.plan
FROM orders o
JOIN customer_dim d
ON d.customer_id = o.customer_id
AND o.order_date >= d.valid_from
AND o.order_date < COALESCE(d.valid_to, '9999-12-31');

The < not ≤ matters

valid_to is exclusive (the new row's valid_from = old row's valid_to). If you use <= on valid_to, you'll match an event to two rows on the changeover date. Use < (or check your warehouse's SCD convention; some use inclusive).

Exercise

Marketing has handed you a CSV with one row per company, one column per quarter (q1, q2, q3, q4) showing pipeline value. You need to (a) unpivot it to long form, (b) compute QoQ growth per company, (c) de-dup if the CSV has multiple rows per company (keep the latest by an as_of_date column). Write the full query.

Key takeaways

  • Pivot: MAX/SUM(CASE WHEN attribute = 'x' THEN value END) per group
  • Unpivot: UNION ALL one branch per column (or native UNPIVOT where supported)
  • Dedup: ROW_NUMBER() OVER (PARTITION BY entity ORDER BY freshness) = 1, with a tiebreaker
  • SCD2 join: ON event_date >= valid_from AND event_date < valid_to — exclusive upper bound
Loading progress…
LeadAfrikPublic Economics Hub