# Purchase Log

> Names on receipts, not just IDs.

Canonical URL: <https://datadriven.io/problems/purchase_log>

Domain: SQL · Difficulty: easy · Seniority: L3

## Problem

The customer success team is auditing recent purchase activity and needs a human-readable transaction log. For every transaction, show the buyer's username alongside the amount spent and the date of the purchase. Only include transactions that exceed fifty dollars.

## Worked solution and explanation

### Why this problem exists in real interviews

The core skill being tested is self-join, applied to the `users` and `transactions` tables in a purchase behavior context. Getting columns like `user_id`, `total_amount`, and `username` right is where most candidates slip.

---

### Break down the requirements

#### Step 1: Join users to transactions

`JOIN transactions ON users.user_id = transactions.user_id` pairs each transaction with the buyer's name.

#### Step 2: Filter to transactions over $50

`WHERE total_amount > 50` restricts to qualifying purchases.

#### Step 3: Return username, amount, and date

Select `username`, `total_amount`, and `transaction_date`.

---

### The solution

**Join users to transactions to find purchase log**

```sql
SELECT u.username, t.total_amount, t.transaction_date
FROM users u
JOIN transactions t ON u.user_id = t.user_id
WHERE t.total_amount > 50
```

> **Cost Analysis**
>
> The query scans `users` (1,000,000 rows), `transactions` (5,000,000 rows). A covering index on the filter and group columns would reduce I/O. At this scale, the full scan is acceptable but becomes costly if the table grows 10x.

> **Interviewers Watch For**
>
> Interviewers evaluate whether you translate the English requirements into the correct SQL clauses on the first attempt. They watch for clean syntax, correct column references, and whether you verify edge cases before declaring the query complete.

> **Common Pitfall**
>
> The most common mistake is misreading the prompt's filtering or grouping requirements. Double-check which columns to group by, which to aggregate, and whether the output should be filtered with `WHERE` (before grouping) or `HAVING` (after grouping).

---

## Common follow-up questions

- The `email` column in `users` has a 5% null rate. How does your query handle rows where `email` is NULL, and could that silently change the result count? _(Tests whether the candidate accounts for NULLs in `users.email` and understands how aggregates skip NULL values.)_
- If `users` grew to contain billions of rows, which part of your query would become the bottleneck given the cardinality of `user_id`? _(Tests ability to identify performance hotspots related to `users.user_id` at scale.)_
- How would you modify this query if the business logic required grouping by both `user_id` and `username` instead of just one? _(Tests ability to adapt the query structure to changing requirements.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/purchase_log)
- [SQL Interview Questions](https://datadriven.io/sql-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.