Cleaning and reshaping — SQL Intermediate Module 7

Real data is messy. Half of an analyst's SQL is shaping data into the form a downstream chart, model, or report can use. This module covers the four reshaping problems you'll meet weekly.

Pivot — long to wide

Long format: one row per (entity, attribute, value). Wide format: one row per entity, columns per attribute. Most BI tools want wide; most warehouses store long. The bridge is CASE-based aggregation.

sql

-- Long input: orders(order_id, status, count)
-- Want wide: order_id, pending, shipped, delivered, cancelled
SELECT order_id,
  MAX(CASE WHEN status = 'pending'   THEN count END) AS pending,
  MAX(CASE WHEN status = 'shipped'   THEN count END) AS shipped,
  MAX(CASE WHEN status = 'delivered' THEN count END) AS delivered,
  MAX(CASE WHEN status = 'cancelled' THEN count END) AS cancelled
FROM orders
GROUP BY order_id;

PIVOT clause where supported

Snowflake, DuckDB, BigQuery, and SQL Server have a native PIVOT. Postgres doesn't (use the CASE form or crosstab() from tablefunc). The CASE form works everywhere and is more flexible — it lets you compute different aggregates per pivoted column.

Unpivot — wide to long

Less common but happens — e.g., when ingesting a CSV with monthly columns. UNION ALL is the universal form:

sql

-- Input: revenue(customer_id, jan, feb, mar, apr)
-- Want long: customer_id, month, amount
SELECT customer_id, 'jan' AS month, jan AS amount FROM revenue UNION ALL
SELECT customer_id, 'feb',           feb               FROM revenue UNION ALL
SELECT customer_id, 'mar',           mar               FROM revenue UNION ALL
SELECT customer_id, 'apr',           apr               FROM revenue;

Postgres has UNNEST + arrays for a cleaner version; SQL Server has UNPIVOT; Snowflake / DuckDB have UNPIVOT too. UNION ALL is the portable fallback.

De-duplication

A dataset has multiple rows per logical entity and you want one. ROW_NUMBER + a tiebreaker is the standard pattern:

sql

-- Keep the most recent row per customer
WITH ranked AS (
  SELECT *,
    ROW_NUMBER() OVER (
      PARTITION BY customer_id
      ORDER BY updated_at DESC, id DESC  -- updated_at first, id as tiebreaker
    ) AS rn
  FROM customers_raw
)
SELECT * FROM ranked WHERE rn = 1;

Snowflake / DuckDB / BigQuery / Trino let you skip the CTE with QUALIFY: SELECT * FROM customers_raw QUALIFY ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY updated_at DESC) = 1.

Slowly-changing dimensions (SCD type 2)

SCD2 dimensions store history: each row has a valid_from and valid_to. The challenge is joining a fact event to the version of the dimension that was current at the event time.

sql

-- customer_dim(customer_id, name, plan, valid_from, valid_to)
-- orders(order_id, customer_id, order_date, amount)
-- We want each order joined to the customer's plan AT THE TIME of the order.
SELECT o.order_id, o.amount, d.name, d.plan
FROM orders o
JOIN customer_dim d
  ON d.customer_id = o.customer_id
  AND o.order_date >= d.valid_from
  AND o.order_date <  COALESCE(d.valid_to, '9999-12-31');

The < not ≤ matters

valid_to is exclusive (the new row's valid_from = old row's valid_to). If you use <= on valid_to, you'll match an event to two rows on the changeover date. Use < (or check your warehouse's SCD convention; some use inclusive).

Exercise

Marketing has handed you a CSV with one row per company, one column per quarter (q1, q2, q3, q4) showing pipeline value. You need to (a) unpivot it to long form, (b) compute QoQ growth per company, (c) de-dup if the CSV has multiple rows per company (keep the latest by an as_of_date column). Write the full query.