# Returning Buyers

> They came back and bought again.

Canonical URL: <https://datadriven.io/problems/returning_buyers>

Domain: SQL · Difficulty: medium · Seniority: L4

## Problem

The retention team needs repeat-buyer signals. Find users who made a second transaction within 1 to 7 days of a previous one, excluding same-day purchases. Return each qualifying user ID once.

## Worked solution and explanation

### Why this problem exists in real interviews

This problem targets self-join across the `transactions` table. You need to work with the `user_id` and `transaction_date` columns to satisfy the requirements.

> **Trick to Solving**
>
> When the prompt asks for pairs or combinations within the same table, a self-join is the pattern.
> 
> 1. Alias the table twice (e.g., `a` and `b`)
> 2. Join on the shared attribute (region, user, etc.)
> 3. Add `a.id < b.id` to avoid duplicate and self-pairs

---

### Break down the requirements

#### Step 1: Self-join the table to pair rows

Join `transactions` to itself to compare rows within the same table. The join condition controls which pairs are valid and prevents duplicate mirrors.

#### Step 2: Deduplicate the result with DISTINCT

`SELECT DISTINCT` removes duplicate rows from the output. This is necessary when joins or subqueries can produce repeated combinations.

#### Step 3: Sort the final output

The `ORDER BY` clause ensures the result appears in the expected sequence. Interviewers check that the sort direction matches the prompt.

---

### The solution

**Self-join the table to pair rows to find returning buyers**

```sql
SELECT DISTINCT t1.user_id
FROM transactions t1
INNER
JOIN transactions t2 ON t1.user_id = t2.user_id AND t2.transaction_date > t1.transaction_date AND julianday(t2.transaction_date) - julianday(t1.transaction_date) BETWEEN 1 AND 7
ORDER BY t1.user_id
```

> **Cost Analysis**
>
> With ~100M rows, the join cost depends on the smaller table's cardinality. An index on the filter/join columns would reduce the scan to a seek.

> **Interviewers Watch For**
>
> Interviewers watch for how you prevent duplicate pairs and ensure the join condition is correct; how you handle date arithmetic and whether you account for edge cases like month boundaries.

> **Common Pitfall**
>
> Using string comparison instead of proper date arithmetic for date ranges can miss edge cases at midnight boundaries.

---

## Common follow-up questions

- What would happen to your result if `transactions.quantity` contained duplicate values that you did not expect? _(Tests whether the candidate considers data quality issues in `quantity` and uses DISTINCT or deduplication where needed.)_
- If `transactions` grew to contain billions of rows, which part of your query would become the bottleneck given the cardinality of `user_id`? _(Tests ability to identify performance hotspots related to `transactions.user_id` at scale.)_
- How would you modify this query if the business logic required grouping by both `transaction_id` and `user_id` instead of just one? _(Tests ability to adapt the query structure to changing requirements.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/returning_buyers)
- [SQL Interview Questions](https://datadriven.io/sql-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.