# Category Buyers

> Which categories have the broadest reach?

Canonical URL: <https://datadriven.io/problems/category_buyers>

Domain: SQL · Difficulty: medium · Seniority: L4

## Problem

The merchandising team is deciding how to concentrate their next promotional push. For each product category, they need to know how many unique customers made a purchase and how much revenue those purchases generated in total. Only include categories that have attracted at least three unique buyers ,  niche categories with a smaller audience aren't in scope. Rank from the most revenue to the least.

## Worked solution and explanation

### Why this problem exists in real interviews

This tests JOIN + GROUP BY + HAVING with multiple aggregates. Interviewers probe whether you can count distinct values, sum amounts, apply thresholds, and sort results in a single well-structured query.

---

### Break down the requirements

#### Step 1: Join products to transactions

Join on `product_id` to associate each transaction with a `category`.

#### Step 2: Aggregate per category

`COUNT(DISTINCT t.user_id)` for unique buyers and `SUM(t.total_amount)` for total revenue.

#### Step 3: Filter with HAVING

`HAVING COUNT(DISTINCT t.user_id) >= 3` removes niche categories.

#### Step 4: Sort by revenue

`ORDER BY total_revenue DESC` ranks from highest revenue to lowest.

---

### The solution

**Distinct count with revenue aggregation**

```sql
SELECT
    p.category,
    COUNT(DISTINCT t.user_id) AS unique_buyers,
    SUM(t.total_amount) AS total_revenue
FROM products p
JOIN transactions t ON p.product_id = t.product_id
GROUP BY p.category
HAVING COUNT(DISTINCT t.user_id) >= 3
ORDER BY total_revenue DESC
```

> **Cost Analysis**
>
> Hash join of 8K products to 50M transactions. The `COUNT(DISTINCT user_id)` requires maintaining a set per group, but with few categories (~20) the memory overhead is manageable. The bottleneck is scanning 50M transaction rows.

> **Interviewers Watch For**
>
> Using `COUNT(user_id)` instead of `COUNT(DISTINCT user_id)` is a frequent mistake that overcounts buyers. Interviewers specifically look for the DISTINCT keyword.

> **Common Pitfall**
>
> Filtering `WHERE` instead of `HAVING` on the buyer count causes a syntax error or incorrect results since aggregate conditions must go in HAVING.

---

## Common follow-up questions

- How would you include categories with fewer than 3 buyers but show them at the bottom? _(Tests removing HAVING and using a CASE in ORDER BY or a flag column.)_
- What if you also needed the average transaction size per category? _(Adding `AVG(t.total_amount)` to the SELECT, testing understanding of multiple aggregates.)_
- How would this query perform with 10M products instead of 8K? _(The hash join memory grows significantly; discusses partitioned joins or index strategies.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/category_buyers)
- [SQL Interview Questions](https://datadriven.io/sql-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.