# Top Buyers of Premium Products

> Which users bought the most top-rated products

Canonical URL: <https://datadriven.io/problems/top_buyers_of_premium_products>

Domain: SQL · Difficulty: medium · Seniority: L4

## Problem

Rank all products by rating, keeping only those with a known rating, and take the top 10 tier. Then find every user who purchased one of those products and count how many unique purchases each user made from that tier. Show the top 10 users by purchase count.

## Worked solution and explanation

### Why this problem exists in real interviews

Two stacked rankings on a 500M row `order_items` table joined to an 800k row `products` dim. Interviewer is watching whether you compute the premium tier first in a CTE, hash-build on that tiny subset, and define `purchase_count` precisely. Ambiguity around `DISTINCT` versus raw row count is the real trap.

---

### Break down the requirements

#### Step 1: Rank products by rating

CTE `top_rated` ranks `products` by `rating DESC` with `DENSE_RANK()`. Filter `rating IS NOT NULL` inside the CTE so unrated products never enter the ranking and cannot tie at the top.

#### Step 2: Cut to the top 10 tier

Apply `WHERE tr.rnk <= 10` after the join. `DENSE_RANK` means tier 10 may contain many products if ratings tie. That is the intended semantic.

#### Step 3: Join to order_items and count per user

`INNER JOIN order_items oi ON oi.product_id = tr.product_id`, then `GROUP BY oi.user_id` with `COUNT(DISTINCT oi.item_id)` so each line item counts once per user.

#### Step 4: Top 10 users

`ORDER BY purchase_count DESC LIMIT 10`. Add a tiebreaker on `user_id` if the interviewer cares about determinism.

---

### The solution

**TOP BUYERS OF PREMIUM TIER**

```sql
WITH top_rated AS (
  SELECT
    product_id,
    rating,
    DENSE_RANK() OVER (ORDER BY rating DESC) AS rnk
  FROM products
  WHERE rating IS NOT NULL
)
SELECT
  oi.user_id,
  COUNT(DISTINCT oi.item_id) AS purchase_count
FROM order_items oi
INNER JOIN top_rated tr
  ON oi.product_id = tr.product_id
WHERE tr.rnk <= 10
GROUP BY oi.user_id
ORDER BY purchase_count DESC
LIMIT 10
```

> **Cost Analysis**
>
> `products` is 800k rows, `order_items` is 500M. The CTE filters to the tier 10 subset (usually a few hundred products), which the planner hash-builds. The 500M row `order_items` becomes the probe side. Push `tr.rnk <= 10` into the CTE if your engine does not predicate-pushdown across the window function.

> **Interviewers Watch For**
>
> Before writing, ask two things out loud: (1) does premium mean top 10 distinct rating values via `DENSE_RANK`, or top 10 products via `ROW_NUMBER`, and (2) does a purchase mean one line item or one distinct product. Pick one and state it. Silent assumptions lose this question.

> **Common Pitfall**
>
> Writing `COUNT(*)` or `COUNT(oi.item_id)` instead of `COUNT(DISTINCT oi.item_id)`. On a denormalized join or a re-run with duplicate line items, raw counts overstate. Also dropping `rating IS NOT NULL`: nulls sort unpredictably across engines and can land in the top tier.

---

### COMMON FOLLOW-UP QUESTIONS

## Common follow-up questions

- Swap `DENSE_RANK` for `ROW_NUMBER`. How does the result change? _(Probes whether you understand tie behavior and how the tier size can balloon under `DENSE_RANK`.)_
- Count distinct products purchased per user instead of distinct line items. _(Tests whether you can swap `COUNT(DISTINCT oi.item_id)` for `COUNT(DISTINCT oi.product_id)` without breaking the join.)_
- Add a filter so only purchases in the last 12 months count. _(Forces you to bring an `order_date` predicate onto `order_items` and reason about index coverage.)_
- Why is `INNER JOIN` correct here instead of `LEFT JOIN`? _(Checks whether you can articulate that buyers with zero premium purchases are not in the result by definition.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/top_buyers_of_premium_products)
- [SQL Interview Questions](https://datadriven.io/sql-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.