Cohort analysis and retention curves — SQL Intermediate Module 4

A cohort is a group of entities that share an entry event (signup, first purchase, loan disbursal). Cohort analysis tracks how those entities behave through time relative to that entry — the most common form is retention.

The cohort table

Step 1: assign each entity to a cohort. Usually the cohort is the month or week of their first event.

sql

WITH cohorts AS (
  SELECT user_id,
    DATE_TRUNC('month', MIN(activity_date)) AS cohort_month
  FROM activity
  GROUP BY user_id
)
SELECT cohort_month, COUNT(*) AS cohort_size
FROM cohorts
GROUP BY cohort_month
ORDER BY cohort_month;

The retention curve

Step 2: for each cohort and each elapsed period, count how many users are still active.

sql

WITH cohorts AS (
  SELECT user_id,
    DATE_TRUNC('month', MIN(activity_date)) AS cohort_month
  FROM activity
  GROUP BY user_id
),
activity_with_cohort AS (
  SELECT a.user_id,
    c.cohort_month,
    DATE_TRUNC('month', a.activity_date) AS active_month
  FROM activity a
  JOIN cohorts c USING (user_id)
)
SELECT cohort_month,
  active_month,
  COUNT(DISTINCT user_id) AS active_users
FROM activity_with_cohort
GROUP BY cohort_month, active_month
ORDER BY cohort_month, active_month;

Step 3: pivot to wide form so the cohort triangle becomes readable. CASE-based pivots work in every dialect; some engines (Postgres with tablefunc, Snowflake, DuckDB) have a PIVOT clause.

sql

WITH base AS (
  /* the GROUP BY query from Step 2, plus a 'months_since_join' calc */
  SELECT cohort_month,
    active_month,
    EXTRACT(YEAR FROM AGE(active_month, cohort_month)) * 12 +
    EXTRACT(MONTH FROM AGE(active_month, cohort_month)) AS months_since,
    COUNT(DISTINCT user_id) AS active_users
  FROM activity_with_cohort
  GROUP BY cohort_month, active_month
)
SELECT cohort_month,
  MAX(CASE WHEN months_since = 0 THEN active_users END) AS m0,
  MAX(CASE WHEN months_since = 1 THEN active_users END) AS m1,
  MAX(CASE WHEN months_since = 2 THEN active_users END) AS m2,
  MAX(CASE WHEN months_since = 3 THEN active_users END) AS m3
FROM base
GROUP BY cohort_month
ORDER BY cohort_month;

Retention rate = activeₙ / active₀

The cohort-size column (m0) is the denominator. Divide every other column by m0 to convert raw counts to retention rates. Some teams display both — counts in the chart hover, rates in the headline number.

When the cohort is the wrong unit

Cohort-by-signup-date is the default, but not always the right one. Consider:

Cohort-by-channel — when you're asking 'is the new acquisition channel retaining as well as organic?'
Cohort-by-feature-adoption — when you're asking 'do users who hit feature X in their first week retain better?' (Activation cohorts.)
Cohort-by-experiment-arm — A/B test cohorts. The cohort grouping is the experiment variant.
No cohort at all — if you're asking 'what's our retention this month?', a rolling-window query is the right shape, not a cohort triangle.

Exercise

Your CFO wants to know whether the loans your microfinance arm disbursed in Q1 2025 are being repaid on schedule, broken down by loan officer. (1) What's the cohort? (2) What's the retention-equivalent metric? (3) Write the query.