# The Ones Who Return

> One purchase is a trial. Two is a habit. Find how many members formed one.

Canonical URL: <https://datadriven.io/problems/the-ones-who-return>

Domain: SQL · Difficulty: medium · Seniority: mid

## Problem

Our member growth team suspects that a large share of the people who buy from us try the service exactly once and never come back, while a smaller core keeps returning again and again, and before the quarterly retention review they want one anchor number to ground that story. Using the complete purchase history, work out what share of the entire member base counts as repeat buyers, meaning any member who shows up with more than a single purchase to their name, and express that share as a percentage of all members rounded to two decimal places.

## Worked solution and explanation

### Why this problem exists in real interviews

This is a retention-rate question dressed up as a one-line metric, and interviewers love it because it forces you to reason at two grains at once. The raw transactions table has one row per purchase, but the business question is about members, so you first have to collapse purchases down to a per-member count, then ask a question about that derived population. The core skill being tested is recognizing that you cannot answer 'what fraction of members did X' by filtering rows directly; you have to aggregate to the member level first, then aggregate again across members. The candidate must also decide the denominator: the percentage is over every member who appears in the table, not over every row.

---

### Break down the requirements

#### Step 1: Collapse purchases to members

Each member can have many rows in transactions. Roll the table up by user_id and count rows per member so you have one number, txn_count, describing how active each member is. This per-member summary is the population you actually reason about.

#### Step 2: Flag the repeat buyers

A repeat buyer is any member whose txn_count is 2 or more. Use a conditional expression that yields 1 for those members and 0 otherwise, so summing it counts exactly the members who cleared the threshold.

#### Step 3: Divide by the full member base

Divide the repeat-buyer count by the total number of members in the summary, multiply by 100.0 to force floating-point math instead of integer truncation, and round to two decimals. The denominator is the count of members in the rolled-up set, not the count of transaction rows.

---

### The solution

**Repeat-member percentage**

```sql
WITH member_activity AS (
  SELECT user_id, COUNT(*) AS txn_count
  FROM transactions
  GROUP BY user_id
)
SELECT ROUND(100.0 * SUM(CASE WHEN txn_count >= 2 THEN 1 ELSE 0 END) / COUNT(*), 2) AS repeat_member_pct
FROM member_activity;
```

> **Cost Analysis**
>
> On a production transactions table of roughly 240M rows spanning 18 months and about 30 GB, the per-member rollup is the expensive step: it is a single hash aggregation on user_id, typically a few GB of spill if memory is tight, with no join to amplify the work. The outer aggregation runs over the distinct-member set (tens of millions of rows), which is cheap by comparison. If the table is partitioned by transaction_date you scan all partitions here since the metric is all-time, but the plan stays one pass with no sort, so it remains a single linear scan plus one grouping.

> **Interviewers Watch For**
>
> A strong candidate says out loud that the percentage is over members, not rows, and writes the two-level aggregation deliberately rather than reaching for a join. They also pre-empt integer division by multiplying by 100.0 (or casting) before dividing, and they confirm the threshold is inclusive of 2. Bonus points for noting the denominator excludes members with zero transactions because such members never appear in this table at all.

> **Common Pitfall**
>
> The classic mistake is computing the percentage at the row grain, for example dividing the count of rows from repeat buyers by the total row count, which massively overstates the result because heavy buyers contribute many rows. The second most common bug is integer division: writing 100 * SUM(...) / COUNT(*) without a float, which truncates to 0 in SQLite and most engines when the ratio is below 1.

---

## Common follow-up questions

- How would you extend this to report the repeat rate per month of signup rather than a single all-time number? _(Tests joining to a member dimension for the signup date and grouping the outer aggregation by a cohort key without double-counting members.)_
- If the table contained accidental duplicate purchase rows, how would your number change and how would you guard against it? _(Tests awareness that duplicate rows inflate per-member counts and push one-time buyers over the threshold, and whether the candidate would deduplicate on a natural key first.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/the-ones-who-return)
- [SQL Interview Questions](https://datadriven.io/sql-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.