# Big Spenders

> The whale list.

Canonical URL: <https://datadriven.io/problems/big_spenders>

Domain: SQL · Difficulty: easy · Seniority: L3

## Problem

Marketing is launching a loyalty tier program and needs to identify high-value customers. For each user, calculate their total spending and the number of transactions they have made. Only include users whose lifetime spending exceeds five hundred dollars. List them from highest spender to lowest.

## Worked solution and explanation

### Why this problem exists in real interviews

Loyalty-tier questions probe whether you reflexively know that `WHERE` filters rows and `HAVING` filters groups. `WHERE total_amount > 500` keeps individual transactions over $500, not users whose lifetime exceeds $500. The interviewer is also checking whether you'd ask 'fixed threshold or top percentile?' before assuming the literal.

---

### Break down the requirements

#### Step 1: Group by user

`GROUP BY user_id`. Every projected column is either the key or an aggregate over `transactions`.

#### Step 2: Two aggregates

`SUM(total_amount) AS lifetime_spend` and `COUNT(*) AS tx_count`. Alias both; the threshold and the sort both reference the sum.

#### Step 3: Threshold after the rollup

`HAVING SUM(total_amount) > 500`. The aggregate doesn't exist before grouping, so this cannot live in `WHERE`.

#### Step 4: Sort the survivors

`ORDER BY lifetime_spend DESC`. The alias is fine in `ORDER BY` in every major dialect.

---

### The solution

**GROUP THEN FILTER**

```sql
SELECT user_id,
       SUM(total_amount) AS lifetime_spend,
       COUNT(*) AS tx_count
FROM transactions
GROUP BY user_id
HAVING SUM(total_amount) > 500
ORDER BY lifetime_spend DESC
```

> **Cost Analysis**
>
> 5M rows, high-cardinality `user_id`. Engine does a full scan plus hash aggregate. Partition pruning on `transaction_date` doesn't help; the prompt says lifetime. If `user_id` cardinality is in the millions, watch hash-agg memory and consider a sort-based agg with a covering index on `(user_id, total_amount)`.

> **Interviewers Watch For**
>
> Ask: is `> 500` a hard dollar threshold or shorthand for a top-tier cohort? If marketing wanted the top 5%, the query is a `HAVING` against a percentile subquery, not a literal. Clarify before writing.

> **Common Pitfall**
>
> Writing `WHERE total_amount > 500` filters individual transactions over $500, not users whose lifetime sum exceeds $500. Different cohort, wrong answer. The threshold is on the aggregate, so it belongs in `HAVING`.

---

### COMMON FOLLOW-UP QUESTIONS

## Common follow-up questions

- Change `>$500 lifetime` to `>$500 in the last 90 days`. _(Tests adding `WHERE transaction_date >= CURRENT_DATE - INTERVAL '90 days'` and confirms you remember it filters before the group.)_
- Include each user's most recent transaction date. _(Adds `MAX(transaction_date)` to the SELECT, checking aggregate-with-grouping fluency.)_
- Rewrite to define big spenders as the top 5% by total spend. _(Probes window functions: `NTILE(20)` or `PERCENT_RANK()` in a subquery, then filter outside.)_
- What if `total_amount` has negative values for refunds? _(Checks whether you'd net them out, exclude them, or compute gross spend with `SUM(CASE WHEN total_amount > 0 THEN total_amount END)`.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/big_spenders)
- [SQL Interview Questions](https://datadriven.io/sql-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.