# Above the Curve

> Spenders who break from the pack.

Canonical URL: <https://datadriven.io/problems/above_the_curve>

Domain: SQL · Difficulty: medium · Seniority: L4

## Problem

The growth team is building a high-value customer segment for a targeted campaign. Compute each user's total spend across all transactions, then surface only users whose total exceeds the average spend. For each qualifying user, show their ID, total spend, and how far above the average they land. Present them sorted from biggest spender to smallest.

## Worked solution and explanation

### Why this problem exists in real interviews

Growth and lifecycle teams constantly ask for above-average cohorts off `transactions`. The interviewer is checking whether you can compute a derived threshold (the average of per-user totals) and filter against it, without collapsing the grain. The trap is computing `AVG(total_amount)` over rows instead of over user sums.

---

### Break down the requirements

#### Step 1: Roll up to user grain

Aggregate `transactions` with `GROUP BY user_id` and `SUM(total_amount)`. This is the unit the prompt cares about, not raw rows.

#### Step 2: Define the threshold

The cutoff is `AVG(total)` over the per-user totals, not `AVG(total_amount)` over transactions. Put the rollup in a CTE so the average runs against the right grain.

#### Step 3: Filter and project the delta

Keep users where `total > (SELECT AVG(total) FROM totals)`, then project `total - that_average` as `above_avg`. The scalar subquery runs once.

#### Step 4: Sort biggest first

`ORDER BY total DESC` so the campaign team sees the whales at the top.

---

### The solution

**ABOVE AVERAGE USERS**

```sql
WITH totals AS (
  SELECT user_id, SUM(total_amount) AS total
  FROM transactions
  GROUP BY user_id
)
SELECT
  user_id,
  total,
  total - (SELECT AVG(total) FROM totals) AS above_avg
FROM totals
WHERE total > (SELECT AVG(total) FROM totals)
ORDER BY total DESC
```

> **Cost Analysis**
>
> 50M rows hash-aggregate into roughly a few million user buckets in one pass. The scalar `AVG(total)` is a second cheap scan of the CTE, materialized once by most planners. No window function needed, no self-join. Partition pruning on `transaction_date` only helps if you add a date filter.

> **Interviewers Watch For**
>
> Before writing, say out loud: the average is over per-user totals, not over transaction rows. Then ask whether refunds or negative `total_amount` count, and whether inactive users (zero spend) should pull the mean down.

> **Common Pitfall**
>
> Trying to do it with `HAVING SUM(total_amount) > AVG(SUM(total_amount))` does not work; `HAVING` evaluates per group and cannot reach the overall average. You need a CTE plus a scalar subquery, or a window `AVG() OVER ()` on the rollup.

---

### COMMON FOLLOW-UP QUESTIONS

## Common follow-up questions

- How would you rewrite this using a window function instead of a scalar subquery? _(Probes whether you can swap `(SELECT AVG(total) FROM totals)` for `AVG(total) OVER ()` and reason about the tradeoff.)_
- What changes if we want users above the 90th percentile instead of the mean? _(Tests `PERCENTILE_CONT` or `NTILE(10)` and whether you understand mean versus quantile thresholds.)_
- How do you make this incremental over a rolling 30 day window? _(Pushes you on `WHERE transaction_date >= CURRENT_DATE - INTERVAL '30 day'` and partition pruning on the 50M row table.)_
- What if `total_amount` can be negative due to refunds? _(Surfaces whether you net them, exclude them, or treat refunds as a separate signal.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/above_the_curve)
- [SQL Interview Questions](https://datadriven.io/sql-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.