Skip to content
Module 11 of 1290 min readBeginner

Mini-projects on a real dataset

Three guided analyses on a public dataset — cohort retention, revenue trends, and an outlier hunt.

92%

Listen along

Read “Mini-projects on a real dataset” aloud

Plays in your browser using on-device text-to-speech — nothing leaves the page.

Three guided analyses on a public dataset. Each one is a working query you can copy and adapt.

Dataset: Olist — Brazilian e-commerce orders (free, on Kaggle)

100k orders, 90k customers, 30k products, 8 related tables. Real data with real messiness — perfect for these exercises. Download from kaggle.com/olistbr/brazilian-ecommerce, load into a DuckDB or Postgres instance.

Project 1: Cohort retention

What percent of customers from each signup month are still placing orders 1, 2, 3, 6 months later?

sql
WITH customer_cohort AS (
SELECT customer_id, date_trunc('month', MIN(created_at)) AS cohort_month
FROM orders
GROUP BY customer_id
),
order_months AS (
SELECT o.customer_id, c.cohort_month,
date_trunc('month', o.created_at) AS order_month
FROM orders o
JOIN customer_cohort c ON c.customer_id = o.customer_id
)
SELECT
cohort_month,
COUNT(DISTINCT CASE WHEN order_month = cohort_month THEN customer_id END) AS m0,
COUNT(DISTINCT CASE WHEN order_month = cohort_month + INTERVAL '1 month' THEN customer_id END) AS m1,
COUNT(DISTINCT CASE WHEN order_month = cohort_month + INTERVAL '3 months' THEN customer_id END) AS m3,
COUNT(DISTINCT CASE WHEN order_month = cohort_month + INTERVAL '6 months' THEN customer_id END) AS m6
FROM order_months
GROUP BY cohort_month
ORDER BY cohort_month;

Read the rows: m0 is the cohort size, m1 is who came back the next month, m3 three months later. Divide each by m0 to get retention percentages.

Project 2: Revenue trends with a moving average

sql
WITH daily AS (
SELECT date_trunc('day', created_at) AS day, SUM(total_amount) AS revenue
FROM orders
GROUP BY 1
)
SELECT
day,
revenue,
AVG(revenue) OVER (ORDER BY day ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) AS rolling_7d,
AVG(revenue) OVER (ORDER BY day ROWS BETWEEN 29 PRECEDING AND CURRENT ROW) AS rolling_30d
FROM daily
ORDER BY day;

Project 3: The outlier hunt

Find orders whose total is more than 3 standard deviations above the customer's own mean order. These are anomalies worth investigating.

sql
WITH stats AS (
SELECT customer_id,
AVG(total_amount) AS mu,
STDDEV(total_amount) AS sd
FROM orders
GROUP BY customer_id
HAVING COUNT(*) >= 5
)
SELECT o.id, o.customer_id, o.total_amount, s.mu, s.sd,
(o.total_amount - s.mu) / NULLIF(s.sd, 0) AS z_score
FROM orders o
JOIN stats s ON s.customer_id = o.customer_id
WHERE (o.total_amount - s.mu) / NULLIF(s.sd, 0) > 3
ORDER BY z_score DESC;

Why this last one is real work

Outlier detection is core analyst work. The same template — compute per-group stats in a CTE, then flag rows that exceed a threshold — applies to fraud detection, quality control, and weird-data hunts. You now know the shape.

Exercise

You inherit an analyst's SQL queries on a Kenyan bank's customer-product database. You need to answer: 'Among customers who opened a current account in 2024, what percentage subsequently took out a loan within 90 days, and what was the average loan amount among those who did?' The relevant tables are customers, accounts, and loans, each with customer_id, opened_date, and amount/balance fields. Write the query, then identify two pitfalls a junior is likely to fall into.

Loading progress…
LeadAfrikPublic Economics Hub