Skip to content
Module 04 of 860 min readIntermediate

Cohort analysis and retention curves

Build cohort tables, compute retention, recognise when cohort isn't the right unit.

50%

Listen along

Read “Cohort analysis and retention curves” aloud

Plays in your browser using on-device text-to-speech — nothing leaves the page.

Learning objectives

By the end of this module, you should be able to:

  • 01Build a cohort table — entities grouped by entry date
  • 02Compute retention curves: Day 1 / Day 7 / Day 30 retention by cohort
  • 03Recognise when the cohort is the wrong unit of analysis

A cohort is a group of entities that share an entry event (signup, first purchase, loan disbursal). Cohort analysis tracks how those entities behave through time relative to that entry — the most common form is retention.

The cohort table

Step 1: assign each entity to a cohort. Usually the cohort is the month or week of their first event.

sql
WITH cohorts AS (
SELECT user_id,
DATE_TRUNC('month', MIN(activity_date)) AS cohort_month
FROM activity
GROUP BY user_id
)
SELECT cohort_month, COUNT(*) AS cohort_size
FROM cohorts
GROUP BY cohort_month
ORDER BY cohort_month;

The retention curve

Step 2: for each cohort and each elapsed period, count how many users are still active.

sql
WITH cohorts AS (
SELECT user_id,
DATE_TRUNC('month', MIN(activity_date)) AS cohort_month
FROM activity
GROUP BY user_id
),
activity_with_cohort AS (
SELECT a.user_id,
c.cohort_month,
DATE_TRUNC('month', a.activity_date) AS active_month
FROM activity a
JOIN cohorts c USING (user_id)
)
SELECT cohort_month,
active_month,
COUNT(DISTINCT user_id) AS active_users
FROM activity_with_cohort
GROUP BY cohort_month, active_month
ORDER BY cohort_month, active_month;

Step 3: pivot to wide form so the cohort triangle becomes readable. CASE-based pivots work in every dialect; some engines (Postgres with tablefunc, Snowflake, DuckDB) have a PIVOT clause.

sql
WITH base AS (
/* the GROUP BY query from Step 2, plus a 'months_since_join' calc */
SELECT cohort_month,
active_month,
EXTRACT(YEAR FROM AGE(active_month, cohort_month)) * 12 +
EXTRACT(MONTH FROM AGE(active_month, cohort_month)) AS months_since,
COUNT(DISTINCT user_id) AS active_users
FROM activity_with_cohort
GROUP BY cohort_month, active_month
)
SELECT cohort_month,
MAX(CASE WHEN months_since = 0 THEN active_users END) AS m0,
MAX(CASE WHEN months_since = 1 THEN active_users END) AS m1,
MAX(CASE WHEN months_since = 2 THEN active_users END) AS m2,
MAX(CASE WHEN months_since = 3 THEN active_users END) AS m3
FROM base
GROUP BY cohort_month
ORDER BY cohort_month;

Retention rate = activeₙ / active₀

The cohort-size column (m0) is the denominator. Divide every other column by m0 to convert raw counts to retention rates. Some teams display both — counts in the chart hover, rates in the headline number.

When the cohort is the wrong unit

Cohort-by-signup-date is the default, but not always the right one. Consider:

  • Cohort-by-channel — when you're asking 'is the new acquisition channel retaining as well as organic?'
  • Cohort-by-feature-adoption — when you're asking 'do users who hit feature X in their first week retain better?' (Activation cohorts.)
  • Cohort-by-experiment-arm — A/B test cohorts. The cohort grouping is the experiment variant.
  • No cohort at all — if you're asking 'what's our retention this month?', a rolling-window query is the right shape, not a cohort triangle.

Exercise

Your CFO wants to know whether the loans your microfinance arm disbursed in Q1 2025 are being repaid on schedule, broken down by loan officer. (1) What's the cohort? (2) What's the retention-equivalent metric? (3) Write the query.

Key takeaways

  • Step 1: assign each entity a cohort. Step 2: join to events and group by cohort + period. Step 3: pivot to wide.
  • Retention rate = activeₙ / active₀ — divide every column by the cohort-size column
  • The cohort doesn't have to be signup date — channel, feature use, or experiment arm are all valid cohort definitions
Loading progress…
LeadAfrikPublic Economics Hub