# The Heaviest Carts

> Inside every age group, a few customers carry the basket. Find them.

Canonical URL: <https://datadriven.io/problems/the-heaviest-carts>

Domain: SQL · Difficulty: medium · Seniority: mid

## Problem

The growth team is segmenting customers by age cohort and wants to know who the biggest spenders are inside each bucket. For every age bucket, return the customer id alongside their total spend for the three customers who spent the most, from the biggest spender down.

## Worked solution and explanation

### Why this problem exists in real interviews

This is a per-group top-N dressed up as customer segmentation. The skill being probed: can you rank inside a partition without flattening the groups together? Anyone can sum spend per customer; the part that separates people is computing the ranking SEPARATELY inside each age_bucket and then deciding which ranking function to use. Pick the wrong one and ties at the boundary silently change your answer.

---

### Break down the requirements

#### Step 1: Attach the cohort and roll up to one row per customer

age_bucket lives on users, spend lives on transactions, so you JOIN on user_id and GROUP BY age_bucket plus user_id with SUM(total_amount). One row per customer is the grain everything downstream depends on. Group by user_id alone and you lose the bucket; group by bucket alone and you collapse customers.

#### Step 2: Rank inside each bucket, not across all customers

PARTITION BY age_bucket restarts the ranking for every cohort, ORDER BY total_spent DESC puts the biggest spender at position 1. Forget the PARTITION BY and you get a single global leaderboard, so only the richest cohort ever shows up in the top 3.

#### Step 3: Keep rank <= 3 and order the output

Filter the windowed result to spend_rank <= 3, then ORDER BY age_bucket, total_spent DESC, user_id so the preview is stable and readable. The window value has to be computed before you can filter on it, which is why it lives in a CTE. Return only age_bucket, user_id, and total_spent; pulling username into the SELECT changes the result shape the grader compares against.

---

### The solution

**TOP SPENDERS PER AGE BUCKET**

```sql
WITH user_spend AS (
  SELECT u.age_bucket,
         t.user_id,
         SUM(t.total_amount) AS total_spent
  FROM transactions t
  JOIN users u ON u.user_id = t.user_id
  GROUP BY u.age_bucket, t.user_id
),
ranked AS (
  SELECT age_bucket,
         user_id,
         total_spent,
         DENSE_RANK() OVER (PARTITION BY age_bucket ORDER BY total_spent DESC) AS spend_rank
  FROM user_spend
)
SELECT age_bucket, user_id, total_spent
FROM ranked
WHERE spend_rank <= 3
ORDER BY age_bucket, total_spent DESC, user_id;
```

> **Common Pitfall**
>
> Reaching for ROW_NUMBER. If two customers in the same bucket are tied on total_spent, ROW_NUMBER breaks the tie arbitrarily and drops one of them below the cutoff, so your 'top 3' silently depends on physical row order and is non-deterministic. DENSE_RANK gives tied customers the same rank and keeps them all, which is what 'the three highest spenders' actually means.

**ROW_NUMBER**

Always 1,2,3,4... even on ties. Two customers tied for 3rd: one is labelled 3, the other 4 and is cut. The kept one depends on tiebreak chance.

**DENSE_RANK**

Ties share a rank: 1,2,3,3,4. Both customers tied for 3rd are kept. The result set is stable no matter how the engine orders equal rows.

> **Interviewers Watch For**
>
> Whether you say out loud which ranking function you chose and why, and whether you ask what should happen on ties before writing. Naming the PARTITION BY age_bucket as the thing that scopes the leaderboard per cohort is the tell that you have done this shape before.

> **Cost Analysis**
>
> On a 40M row transactions table the hash aggregate collapses it to roughly a few million (user, bucket) rows in one pass. The window sort is then partitioned by age_bucket, a handful of buckets, so each partition sorts independently and cheaply. No self-join, no correlated subquery. An index on transactions(user_id) keeps the join probe fast.

---

## Common follow-up questions

- Show only the single top spender per bucket, and if there is a tie return all of them. _(Tests whether they keep DENSE_RANK with rank = 1 rather than switching to ROW_NUMBER, which would arbitrarily drop tied leaders.)_
- Customers with no transactions should appear with zero spend. How does the query change? _(Forces a LEFT JOIN from users to transactions and COALESCE(SUM, 0), surfacing the inner-join-drops-zero-spenders trap.)_
- Now rank by spend within the last 90 days instead of all time. _(Probes pushing a transaction_date predicate into the CTE and whether they understand it narrows the population before ranking.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/the-heaviest-carts)
- [SQL Interview Questions](https://datadriven.io/sql-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.