# Repeat Purchase Window

> The retention squad is looking for repeat purchasers.

Canonical URL: <https://datadriven.io/problems/repeat_purchase_window>

Domain: SQL · Difficulty: medium · Seniority: L4

## Problem

A churn model needs repeat-purchase signals. Find users who made another purchase within 1 to 7 days of a previous transaction, excluding same-day transactions. Return each qualifying user ID once.

## Worked solution and explanation

### Why this problem exists in real interviews

Built around the `transactions` table, this challenge probes your ability to apply LAG for prior-row comparison in a churn analysis setting. Correctly referencing the `user_id` and `transaction_date` columns is essential to a working solution.

> **Trick to Solving**
>
> Whenever the prompt asks you to compare a row to its predecessor or successor, that is a `LAG`/`LEAD` signal.
> 
> 1. Identify the comparison direction (previous vs. next)
> 2. Partition by the grouping key (e.g., user, service)
> 3. Order by the time or sequence column
> 4. Compute the difference in the outer query

---

### Break down the requirements

#### Step 1: Apply the range filter

The WHERE clause restricts rows to the target range. Applying this filter early reduces the volume flowing into downstream operations.

#### Step 2: Sort the final output

The `ORDER BY` clause ensures the result appears in the expected sequence. Interviewers check that the sort direction matches the prompt.

#### Step 3: Use a subquery to find the reference value

The scalar subquery computes a single value (like the maximum) that the outer query filters against. This avoids a self-join.

---

### The solution

**Lag-compare for repeat purchase window**

```sql
SELECT DISTINCT user_id
FROM ( SELECT user_id, transaction_date, LAG(transaction_date) OVER (PARTITION BY user_id ORDER BY transaction_date) AS prev_date FROM transactions )
WHERE julianday(transaction_date) - julianday(prev_date) BETWEEN 1 AND 7
```

> **Cost Analysis**
>
> With ~90M rows, the window function runs on the reduced set after filtering and grouping. An index on the filter/join columns would reduce the scan to a seek.

> **Interviewers Watch For**
>
> Interviewers watch for whether you reach for window functions or attempt a self-join for row-to-row comparison; how you handle date arithmetic and whether you account for edge cases like month boundaries; whether you use a subquery or self-join, and can explain the tradeoffs.

> **Common Pitfall**
>
> Using string comparison instead of proper date arithmetic for date ranges can miss edge cases at midnight boundaries.

---

## Common follow-up questions

- What would happen to your result if `transactions.total_amount` contained duplicate values that you did not expect? _(Tests whether the candidate considers data quality issues in `total_amount` and uses DISTINCT or deduplication where needed.)_
- If `transactions` grew to contain billions of rows, which part of your query would become the bottleneck given the cardinality of `transaction_id`? _(Tests ability to identify performance hotspots related to `transactions.transaction_id` at scale.)_
- Your query uses LAG to compare adjacent rows. What happens if the partition has only one row, and how would you handle that? _(Tests edge-case handling when LAG returns NULL for partitions with insufficient rows.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/repeat_purchase_window)
- [SQL Interview Questions](https://datadriven.io/sql-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.