# Whale Watch

> The accounts driving the top line.

Canonical URL: <https://datadriven.io/problems/whale_watch>

Domain: SQL · Difficulty: easy · Seniority: L3

## Problem

The revenue operations team is building an account health dashboard and needs to identify high-value users. For each user, calculate their total spending, the number of transactions, and the average transaction size. Only surface users whose total spending exceeds five hundred dollars, sorted from biggest spender to smallest. Round the average to two decimal places.

## Worked solution and explanation

### Why this problem exists in real interviews

This tests GROUP BY with multiple aggregates and a HAVING filter. Interviewers check whether you can compute several metrics per group and filter on one of them.

---

### Break down the requirements

#### Step 1: Aggregate per user

`GROUP BY user_id` with `SUM(total_amount)`, `COUNT(*)`, and `ROUND(AVG(total_amount), 2)` computes total spend, transaction count, and average transaction size.

#### Step 2: Filter to high-value users

`HAVING SUM(total_amount) > 500` surfaces only users whose lifetime spend exceeds $500.

#### Step 3: Order by spend descending

`ORDER BY total_spend DESC` lists the biggest spenders first.

---

### The solution

**Multi-aggregate with HAVING threshold**

```sql
SELECT
    user_id,
    SUM(total_amount) AS total_spend,
    COUNT(*) AS txn_count,
    ROUND(AVG(total_amount), 2) AS avg_txn_size
FROM transactions
GROUP BY user_id
HAVING SUM(total_amount) > 500
ORDER BY total_spend DESC
```

> **Cost Analysis**
>
> Full scan of 5M rows. GROUP BY reduces to 800K users. HAVING filters to the subset exceeding $500. The sort on the filtered result is fast.

> **Interviewers Watch For**
>
> Using HAVING instead of WHERE for the spend threshold. Since the filter is on an aggregate (SUM), it must be in HAVING, not WHERE.

> **Common Pitfall**
>
> Putting `SUM(total_amount) > 500` in the WHERE clause. WHERE filters individual rows before aggregation, so it cannot reference aggregates.

---

## Common follow-up questions

- How would you add a percentile rank for each user's total spend? _(Use PERCENT_RANK() OVER (ORDER BY SUM(total_amount)) in a subquery.)_
- What if total_amount included refunds (negative values)? _(SUM handles negatives; decide if net spend or gross spend is the right metric.)_
- How would you segment these users into spend tiers? _(Add a CASE expression for tier labels based on total_spend ranges.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/whale_watch)
- [SQL Interview Questions](https://datadriven.io/sql-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.