Real data is messy. Half of an analyst's SQL is shaping data into the form a downstream chart, model, or report can use. This module covers the four reshaping problems you'll meet weekly.
Pivot — long to wide
Long format: one row per (entity, attribute, value). Wide format: one row per entity, columns per attribute. Most BI tools want wide; most warehouses store long. The bridge is CASE-based aggregation.
-- Long input: orders(order_id, status, count)-- Want wide: order_id, pending, shipped, delivered, cancelledSELECT order_id,MAX(CASE WHEN status = 'pending' THEN count END) AS pending,MAX(CASE WHEN status = 'shipped' THEN count END) AS shipped,MAX(CASE WHEN status = 'delivered' THEN count END) AS delivered,MAX(CASE WHEN status = 'cancelled' THEN count END) AS cancelledFROM ordersGROUP BY order_id;
PIVOT clause where supported
Snowflake, DuckDB, BigQuery, and SQL Server have a native PIVOT. Postgres doesn't (use the CASE form or crosstab() from tablefunc). The CASE form works everywhere and is more flexible — it lets you compute different aggregates per pivoted column.
Unpivot — wide to long
Less common but happens — e.g., when ingesting a CSV with monthly columns. UNION ALL is the universal form:
-- Input: revenue(customer_id, jan, feb, mar, apr)-- Want long: customer_id, month, amountSELECT customer_id, 'jan' AS month, jan AS amount FROM revenue UNION ALLSELECT customer_id, 'feb', feb FROM revenue UNION ALLSELECT customer_id, 'mar', mar FROM revenue UNION ALLSELECT customer_id, 'apr', apr FROM revenue;
Postgres has UNNEST + arrays for a cleaner version; SQL Server has UNPIVOT; Snowflake / DuckDB have UNPIVOT too. UNION ALL is the portable fallback.
De-duplication
A dataset has multiple rows per logical entity and you want one. ROW_NUMBER + a tiebreaker is the standard pattern:
-- Keep the most recent row per customerWITH ranked AS (SELECT *,ROW_NUMBER() OVER (PARTITION BY customer_idORDER BY updated_at DESC, id DESC -- updated_at first, id as tiebreaker) AS rnFROM customers_raw)SELECT * FROM ranked WHERE rn = 1;
Snowflake / DuckDB / BigQuery / Trino let you skip the CTE with QUALIFY: SELECT * FROM customers_raw QUALIFY ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY updated_at DESC) = 1.
Slowly-changing dimensions (SCD type 2)
SCD2 dimensions store history: each row has a valid_from and valid_to. The challenge is joining a fact event to the version of the dimension that was current at the event time.
-- customer_dim(customer_id, name, plan, valid_from, valid_to)-- orders(order_id, customer_id, order_date, amount)-- We want each order joined to the customer's plan AT THE TIME of the order.SELECT o.order_id, o.amount, d.name, d.planFROM orders oJOIN customer_dim dON d.customer_id = o.customer_idAND o.order_date >= d.valid_fromAND o.order_date < COALESCE(d.valid_to, '9999-12-31');
The < not ≤ matters
valid_to is exclusive (the new row's valid_from = old row's valid_to). If you use <= on valid_to, you'll match an event to two rows on the changeover date. Use < (or check your warehouse's SCD convention; some use inclusive).
Exercise
Marketing has handed you a CSV with one row per company, one column per quarter (q1, q2, q3, q4) showing pipeline value. You need to (a) unpivot it to long form, (b) compute QoQ growth per company, (c) de-dup if the CSV has multiple rows per company (keep the latest by an as_of_date column). Write the full query.