Three guided analyses on a public dataset. Each one is a working query you can copy and adapt.
Dataset: Olist — Brazilian e-commerce orders (free, on Kaggle)
100k orders, 90k customers, 30k products, 8 related tables. Real data with real messiness — perfect for these exercises. Download from kaggle.com/olistbr/brazilian-ecommerce, load into a DuckDB or Postgres instance.
Project 1: Cohort retention
What percent of customers from each signup month are still placing orders 1, 2, 3, 6 months later?
WITH customer_cohort AS (SELECT customer_id, date_trunc('month', MIN(created_at)) AS cohort_monthFROM ordersGROUP BY customer_id),order_months AS (SELECT o.customer_id, c.cohort_month,date_trunc('month', o.created_at) AS order_monthFROM orders oJOIN customer_cohort c ON c.customer_id = o.customer_id)SELECTcohort_month,COUNT(DISTINCT CASE WHEN order_month = cohort_month THEN customer_id END) AS m0,COUNT(DISTINCT CASE WHEN order_month = cohort_month + INTERVAL '1 month' THEN customer_id END) AS m1,COUNT(DISTINCT CASE WHEN order_month = cohort_month + INTERVAL '3 months' THEN customer_id END) AS m3,COUNT(DISTINCT CASE WHEN order_month = cohort_month + INTERVAL '6 months' THEN customer_id END) AS m6FROM order_monthsGROUP BY cohort_monthORDER BY cohort_month;
Read the rows: m0 is the cohort size, m1 is who came back the next month, m3 three months later. Divide each by m0 to get retention percentages.
Project 2: Revenue trends with a moving average
WITH daily AS (SELECT date_trunc('day', created_at) AS day, SUM(total_amount) AS revenueFROM ordersGROUP BY 1)SELECTday,revenue,AVG(revenue) OVER (ORDER BY day ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) AS rolling_7d,AVG(revenue) OVER (ORDER BY day ROWS BETWEEN 29 PRECEDING AND CURRENT ROW) AS rolling_30dFROM dailyORDER BY day;
Project 3: The outlier hunt
Find orders whose total is more than 3 standard deviations above the customer's own mean order. These are anomalies worth investigating.
WITH stats AS (SELECT customer_id,AVG(total_amount) AS mu,STDDEV(total_amount) AS sdFROM ordersGROUP BY customer_idHAVING COUNT(*) >= 5)SELECT o.id, o.customer_id, o.total_amount, s.mu, s.sd,(o.total_amount - s.mu) / NULLIF(s.sd, 0) AS z_scoreFROM orders oJOIN stats s ON s.customer_id = o.customer_idWHERE (o.total_amount - s.mu) / NULLIF(s.sd, 0) > 3ORDER BY z_score DESC;
Why this last one is real work
Outlier detection is core analyst work. The same template — compute per-group stats in a CTE, then flag rows that exceed a threshold — applies to fraud detection, quality control, and weird-data hunts. You now know the shape.
Exercise
You inherit an analyst's SQL queries on a Kenyan bank's customer-product database. You need to answer: 'Among customers who opened a current account in 2024, what percentage subsequently took out a loan within 90 days, and what was the average loan amount among those who did?' The relevant tables are customers, accounts, and loans, each with customer_id, opened_date, and amount/balance fields. Write the query, then identify two pitfalls a junior is likely to fall into.