# New vs Returning User Share

> Fresh faces versus familiar ones.

Canonical URL: <https://datadriven.io/problems/new_vs_returning_user_share>

Domain: SQL · Difficulty: hard · Seniority: L4

## Problem

A user is 'new' in the month of their very first event in event_data; in every subsequent month they are 'returning'. For each month, compute the ratio of new users and returning users to total active users. Return the month, new user ratio, and returning user ratio.

## Worked solution and explanation

### Why this problem exists in real interviews

Cohort dashboards live on the new-vs-returning split, and the interviewer is checking whether you reach for a materialized anchor CTE instead of a correlated subquery that re-scans 300M rows per user. They also want to hear `COUNT(DISTINCT user_id)`, not `COUNT(*)`, when counting active users, because a user with 30 events in a month still counts once.

---

### Break down the requirements

#### Step 1: Anchor each user to their first month

`SELECT user_id, strftime('%Y-%m', MIN(event_timestamp)) FROM event_data GROUP BY user_id`. One row per user. This is the global truth used by every downstream month.

#### Step 2: Build the per-month active set

`SELECT DISTINCT user_id, strftime('%Y-%m', event_timestamp) AS month FROM event_data`. DISTINCT collapses a user firing 400 events in March into a single (user, month) row. Without it, the ratios become event-share, not user-share.

#### Step 3: Join active to anchor

Inner join on `user_id`. Every active user has exactly one anchor row, so the join is one-to-one per (user, month). No fan-out.

#### Step 4: Conditional aggregate the labels

`SUM(CASE WHEN ufm.first_month = ma.month THEN 1 ELSE 0 END)` for new, `< ma.month` for returning. Divide by `COUNT(*)` over the joined set, which equals distinct active users for that month. `CAST(... AS REAL)` to force float division.

---

### The solution

**MATERIALIZED-ONCE ANCHOR JOIN**

```sql
WITH user_first_month AS (
  SELECT
    user_id,
    strftime('%Y-%m', MIN(event_timestamp)) AS first_month
  FROM event_data
  GROUP BY user_id
),
monthly_active AS (
  SELECT DISTINCT
    user_id,
    strftime('%Y-%m', event_timestamp) AS month
  FROM event_data
)
SELECT
  ma.month,
  CAST(SUM(CASE WHEN ufm.first_month = ma.month THEN 1 ELSE 0 END) AS REAL) / COUNT(*) AS new_user_ratio,
  CAST(SUM(CASE WHEN ufm.first_month < ma.month THEN 1 ELSE 0 END) AS REAL) / COUNT(*) AS returning_user_ratio
FROM monthly_active ma
JOIN user_first_month ufm ON ma.user_id = ufm.user_id
GROUP BY ma.month
ORDER BY ma.month
```

> **Cost Analysis**
>
> Two full scans of the 300M-row `event_data`, one for `user_first_month` and one for `monthly_active`. Neither can be partition-pruned: computing a per-user MIN means you have to read every partition to know none of them holds an earlier event. Plan on a hash aggregate on `user_id` (cardinality is users, not events, so it fits memory) followed by a hash join keyed on `user_id`. The final group by `month` is cheap, 12 to 60 buckets typical.

> **Interviewers Watch For**
>
> The question a senior asks before writing SQL: 'first month by signup_date or by first event?' For users who register and lurk for three months, those diverge and the cohort sizes shift. The prompt says first event, so `MIN(event_timestamp)` is the right anchor, not a `users.signup_date` join. Calling that out unprompted is the signal.

> **Common Pitfall**
>
> Using `COUNT(*)` over raw `event_data` instead of `COUNT(DISTINCT user_id)`. A power user with 5,000 events in October dominates the denominator and the ratios become event-weighted, not user-weighted. The CTE wrapper with `SELECT DISTINCT user_id, month` is what makes the downstream `COUNT(*)` semantically equal to active users.

**Correlated, O(N x U)**

`WHERE NOT EXISTS (SELECT 1 FROM event_data e2 WHERE e2.user_id = e.user_id AND e2.event_timestamp < date_trunc('month', e.event_timestamp))`. Correct, but the planner re-probes `event_data` per outer row. On 300M rows this is hours, not minutes.

**Materialized CTE, O(N)**

Compute `MIN(event_timestamp)` per user once, store U rows, hash-join. One pass to build, one pass to probe. Scales linearly with events, not events times users.

> **The takeaway**
>
> Whenever the question is 'for each row in slice X, compare against this row's per-entity global state', materialize the per-entity state in a CTE. Correlated subqueries are the wrong shape past a few million rows.

---

### COMMON FOLLOW-UP QUESTIONS

## Common follow-up questions

- How would the query change if you wanted 'new' defined by `users.signup_date` instead of first event? _(Tests whether the candidate sees that anchor source is a product decision, not a SQL detail, and that the JOIN target swaps from a derived CTE to a dimension table.)_
- How would you incrementalize this so each new month doesn't re-scan all 300M rows? _(Probes whether they'd persist `user_first_month` as a slowly changing table (insert-only, anchor never moves backward) and only re-aggregate the latest month.)_
- What breaks if a user has events in March, then no events for a year, then events in next March? _(Tests the difference between 'new' (first ever) and 'resurrected' (returning after a gap), a common product-analytics nuance the prompt collapses.)_
- How would you add a third bucket for 'resurrected' users with a 90-day inactivity threshold? _(Forces a LAG over per-user months and a date diff, exercising window functions on top of the same anchor-join shape.)_
- The new_user_ratio in the first month of the dataset is always 1.0. Is that a bug? _(Catches whether the candidate recognizes the left-censoring boundary: anyone whose true first event predates the dataset gets falsely tagged new in month one.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/new_vs_returning_user_share)
- [SQL Interview Questions](https://datadriven.io/sql-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.