# Second Purchase

> The first buy is curiosity. The second is commitment.

Canonical URL: <https://datadriven.io/problems/second_purchase>

Domain: SQL · Difficulty: hard · Seniority: L5

## Problem

The onboarding team wants to study second-purchase behavior. For each user with more than one transaction, order their transactions by transaction_date and isolate the second one. Return the user_id, total_amount, and transaction_date of that second purchase.

## Worked solution and explanation

### Why this problem exists in real interviews

This purchase behavior problem uses the `transactions` table to evaluate ROW_NUMBER for sequential numbering. Watch how columns like `user_id`, `total_amount`, and `transaction_date` interact in the grouping and filtering logic.

> **Trick to Solving**
>
> Complex queries with multiple transformation stages are best solved by writing one CTE per stage.
> 
> 1. Identify the intermediate results needed
> 2. Write each as a named CTE
> 3. Chain them: each CTE reads from the previous one
> 4. The final SELECT assembles the output

---

### Break down the requirements

#### Step 1: Isolate the intermediate result in a CTE

The `numbered` CTE computes the intermediate aggregation that the outer query builds on. This separation keeps each layer focused on a single task.

#### Step 2: Select the target columns

The SELECT clause picks exactly the columns the prompt asks for. Returning extra columns or missing a required alias would fail the grading check.

---

### The solution

**Row-number for second purchase**

```sql
WITH numbered AS (
    SELECT user_id, total_amount, transaction_date, ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY transaction_date) AS rn
    FROM transactions
)
SELECT user_id, total_amount, transaction_date
FROM numbered
WHERE rn = 2
```

> **Cost Analysis**
>
> With ~150M rows, the window function runs on the reduced set after filtering and grouping; CTEs materialize intermediate results, which can be beneficial or costly depending on the engine. An index on the filter/join columns would reduce the scan to a seek.

> **Interviewers Watch For**
>
> Interviewers watch for whether you decompose the problem into named, testable stages rather than nesting everything; whether you use a subquery or self-join, and can explain the tradeoffs.

> **Common Pitfall**
>
> Returning extra columns that the prompt did not ask for, or using the wrong column alias, causes a grading mismatch even when the logic is correct.

---

## Common follow-up questions

- What would happen to your result if `transactions.quantity` contained duplicate values that you did not expect? _(Tests whether the candidate considers data quality issues in `quantity` and uses DISTINCT or deduplication where needed.)_
- With 8,000,000 distinct values in `transactions.user_id`, how would a composite index on the GROUP BY columns change the execution plan? _(Probes understanding of how cardinality in `user_id` affects grouping and sort operations.)_
- Does your database engine materialize the CTE or inline it? How would that affect repeated scans of `transactions`? _(Tests understanding of CTE materialization semantics.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/second_purchase)
- [SQL Interview Questions](https://datadriven.io/sql-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.