# Multi-Category Buyers

> One-category shoppers are boring.

Canonical URL: <https://datadriven.io/problems/multi_category_buyers>

Domain: SQL · Difficulty: hard · Seniority: L4

## Problem

The recommendations team is building a cross-category shopper segment. Ignoring products with no category on file, find users whose purchases span more than one product category. Return each qualifying username and how many categories they've bought from.

## Worked solution and explanation

### Why this problem exists in real interviews

This challenge forces you to reconcile users.username with transactions.quantity through HAVING filter and grouping. It surfaces in senior-level rounds because candidates who skip the grain check produce plausible but incorrect numbers.

> **Trick to Solving**
>
> Read the prompt carefully for implicit constraints. The phrase structure hints at the grain of the output: what each row represents.
> 
> 1. Identify the output grain from the prompt (one row per what?)
> 2. Work backward from the desired output columns
> 3. Build the query inside-out: innermost subquery first, then layer on filters and aggregates

---

### Break down the requirements

#### Step 1: Join tables with INNER JOIN

Connect `users` and `transactions` and `products` on `product_id`, `user_id` to keep only matching rows.

#### Step 2: Aggregate with COUNT DISTINCT

Group by the output grain and apply `COUNT DISTINCT()` to compute the metric. The `GROUP BY` must match exactly what the output needs: one row per group key.

#### Step 3: Filter groups with HAVING

The `HAVING` clause filters after aggregation, unlike `WHERE` which filters before. This is necessary when the condition depends on an aggregate result.

#### Step 4: Order the final output

Apply `ORDER BY` as specified to produce the expected row sequence. When tied values exist, add a secondary sort column for determinism.

---

### The solution

**JOIN with HAVING for multi-category filter**

```sql
SELECT t.user_id, COUNT(DISTINCT p.category) AS category_count
FROM transactions t
JOIN products p ON t.product_id = p.product_id
GROUP BY t.user_id
HAVING COUNT(DISTINCT p.category) > 1
ORDER BY category_count DESC
```

> **Cost Analysis**
>
> The join touches `users` (15M rows) and `transactions` (120M rows) and `products` (40K rows). `transactions` is partitioned by `transaction_date`, which the optimizer can exploit with a partition filter. CTEs in most engines are optimization fences. For production workloads, consider inlining or materializing the intermediate results.

> **Interviewers Watch For**
>
> Interviewers expect you to articulate why you chose a specific join type and what happens to unmatched rows. Breaking complex logic into named CTEs shows the interviewer you prioritize readability and debuggability.

> **Common Pitfall**
>
> Forgetting that a JOIN can multiply rows when the relationship is one-to-many. Always check whether the join key is unique on at least one side.

---

## Common follow-up questions

- What happens to your result if products.rating contains NULLs for some rows? _(Tests whether the candidate accounts for NULL behavior in aggregates and comparisons on rating.)_
- What is the difference between filtering in WHERE versus HAVING for this query against users? _(Tests whether the candidate understands pre-aggregation vs post-aggregation filtering.)_
- With millions of distinct values in users.user_id, what index strategy would you use to keep this query performant? _(Tests indexing knowledge specific to high-cardinality columns like user_id.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/multi_category_buyers)
- [SQL Interview Questions](https://datadriven.io/sql-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.