Mini-projects on a real dataset — SQL Module 11 | LeadAfrik Public Economics Hub

Three guided analyses on a public dataset. Each one is a working query you can copy and adapt.

Dataset: Olist — Brazilian e-commerce orders (free, on Kaggle)

100k orders, 90k customers, 30k products, 8 related tables. Real data with real messiness — perfect for these exercises. Download from kaggle.com/olistbr/brazilian-ecommerce, load into a DuckDB or Postgres instance.

Project 1: Cohort retention

What percent of customers from each signup month are still placing orders 1, 2, 3, 6 months later?

sql

WITH customer_cohort AS (
  SELECT customer_id, date_trunc('month', MIN(created_at)) AS cohort_month
  FROM orders
  GROUP BY customer_id
),
order_months AS (
  SELECT o.customer_id, c.cohort_month,
         date_trunc('month', o.created_at) AS order_month
  FROM orders o
  JOIN customer_cohort c ON c.customer_id = o.customer_id
)
SELECT 
  cohort_month,
  COUNT(DISTINCT CASE WHEN order_month = cohort_month THEN customer_id END) AS m0,
  COUNT(DISTINCT CASE WHEN order_month = cohort_month + INTERVAL '1 month' THEN customer_id END) AS m1,
  COUNT(DISTINCT CASE WHEN order_month = cohort_month + INTERVAL '3 months' THEN customer_id END) AS m3,
  COUNT(DISTINCT CASE WHEN order_month = cohort_month + INTERVAL '6 months' THEN customer_id END) AS m6
FROM order_months
GROUP BY cohort_month
ORDER BY cohort_month;

Read the rows: m0 is the cohort size, m1 is who came back the next month, m3 three months later. Divide each by m0 to get retention percentages.

Project 2: Revenue trends with a moving average

sql

WITH daily AS (
  SELECT date_trunc('day', created_at) AS day, SUM(total_amount) AS revenue
  FROM orders
  GROUP BY 1
)
SELECT 
  day,
  revenue,
  AVG(revenue) OVER (ORDER BY day ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) AS rolling_7d,
  AVG(revenue) OVER (ORDER BY day ROWS BETWEEN 29 PRECEDING AND CURRENT ROW) AS rolling_30d
FROM daily
ORDER BY day;

Project 3: The outlier hunt

Find orders whose total is more than 3 standard deviations above the customer's own mean order. These are anomalies worth investigating.

sql

WITH stats AS (
  SELECT customer_id,
         AVG(total_amount) AS mu,
         STDDEV(total_amount) AS sd
  FROM orders
  GROUP BY customer_id
  HAVING COUNT(*) >= 5
)
SELECT o.id, o.customer_id, o.total_amount, s.mu, s.sd,
       (o.total_amount - s.mu) / NULLIF(s.sd, 0) AS z_score
FROM orders o
JOIN stats s ON s.customer_id = o.customer_id
WHERE (o.total_amount - s.mu) / NULLIF(s.sd, 0) > 3
ORDER BY z_score DESC;

Why this last one is real work

Outlier detection is core analyst work. The same template — compute per-group stats in a CTE, then flag rows that exceed a threshold — applies to fraud detection, quality control, and weird-data hunts. You now know the shape.

Exercise

You inherit an analyst's SQL queries on a Kenyan bank's customer-product database. You need to answer: 'Among customers who opened a current account in 2024, what percentage subsequently took out a loan within 90 days, and what was the average loan amount among those who did?' The relevant tables are customers, accounts, and loans, each with customer_id, opened_date, and amount/balance fields. Write the query, then identify two pitfalls a junior is likely to fall into.