# Monthly Cohort Retention

> Compute month over month retention rates for user signup cohorts.

Canonical URL: <https://datadriven.io/problems/monthly_cohort_retention>

Domain: SQL · Difficulty: medium · Seniority: L4

## Problem

Build a cohort retention analysis. Users are grouped by the month they signed up. For each signup cohort, calculate what fraction of users were active in each subsequent month. A user is active in a month if they have at least one session during that month. Return the cohort month, the months since signup, and the retention rate.

## Worked solution and explanation

### Why this problem exists in real interviews

Cohort retention is the canonical growth metric: it answers whether new signups stick. Interviewers love this prompt because it forces the candidate to do three things at once: bucket users by their signup month, derive months since signup from two arbitrary date strings, and divide active counts by cohort size without dropping the denominator. The day count divided by 30 trick is a deliberate simplification to avoid SQLite gymnastics.

---

### Break down the requirements

#### Step 1: Build cohorts and activity months

substr(signup_date, 1, 7) gives YYYY-MM. The cohorts CTE attaches each user_id to a cohort_month. activity uses DISTINCT user_id and substr(session_start, 1, 7) so a user with 50 sessions in March counts once for March.

#### Step 2: Pre compute cohort sizes

cohort_sizes is a separate CTE because the denominator must be the full cohort, not the count of users active in any given later month. Computing it once and joining lets every (cohort_month, months_since_signup) row divide by the same number.

#### Step 3: Months since signup via julianday

Both substr values are concatenated with the literal day component to make valid dates, then julianday gives a Julian day number. Subtraction yields days, divided by 30, CAST to INTEGER to truncate. This intentionally approximates a calendar month and matches the prompt; do NOT try to use Postgres style EXTRACT or DATE_TRUNC.

---

### The solution

**Three CTEs join into a retention grid**

```sql
WITH cohorts AS (SELECT user_id, substr(signup_date, 1, 7) AS cohort_month FROM users), activity AS (SELECT DISTINCT user_id, substr(session_start, 1, 7) AS active_month FROM user_sessions), cohort_sizes AS (SELECT cohort_month, COUNT(*) AS cohort_size FROM cohorts GROUP BY cohort_month) SELECT c.cohort_month, CAST((julianday(a.active_month || '-01') - julianday(c.cohort_month || '-01')) / 30 AS INTEGER) AS months_since_signup, CAST(COUNT(DISTINCT a.user_id) AS DOUBLE) / cs.cohort_size AS retention_rate FROM cohorts c JOIN activity a ON c.user_id = a.user_id JOIN cohort_sizes cs ON c.cohort_month = cs.cohort_month GROUP BY c.cohort_month, months_since_signup ORDER BY c.cohort_month, months_since_signup
```

> **Cost Analysis**
>
> users at 10M and user_sessions much larger drive the cost. The DISTINCT in activity is the most expensive single operation; sessions hash by (user_id, month). The join on user_id then collapses to a per cohort, per active_month aggregate of size at most 24 months times the number of cohort months. cohort_sizes is a tiny GROUP BY.

> **Interviewers Watch For**
>
> Whether you cast the numerator to DOUBLE before dividing (otherwise integer division gives 0), whether the cohort size is the full cohort rather than a per period count, and whether you correctly use DISTINCT user_id in activity to avoid inflating retention with users who logged in multiple times in a month.

> **Common Pitfall**
>
> Computing months_since_signup as (active year minus cohort year) times 12 plus (active month minus cohort month) without first padding both to YYYY-MM-01 looks tempting but breaks at year boundaries when month strings are compared lexically. The julianday divide by 30 trick is intentionally fuzzy but consistent. Also, dividing COUNT(DISTINCT a.user_id) by cs.cohort_size without CAST returns 0 for almost every cell.

---

## Common follow-up questions

- Why use julianday divided by 30 instead of computing exact calendar months? _(Tests whether the candidate knows that SQLite has no native month diff function and that the prompt explicitly accepts the approximation. In Postgres you would use AGE or EXTRACT(EPOCH FROM AGE).)_
- What happens to retention numbers if a user signs up, never returns, then signs up again later? _(Tests cohort assignment thinking. The current query takes the user's signup_date as fixed, so the second signup is invisible. Some products want 'last reactivation' cohorts, which requires a different cohort definition.)_
- How would you cap the output at the first 12 months since signup? _(Tests whether the candidate can add a final WHERE on the derived months_since_signup and discuss why filtering after the GROUP BY is acceptable here.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/monthly_cohort_retention)
- [SQL Interview Questions](https://datadriven.io/sql-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.