# Top Category by User Segment

> Each segment has a favorite category.

Canonical URL: <https://datadriven.io/problems/top_category_by_user_segment>

Domain: SQL · Difficulty: medium · Seniority: L4

## Problem

For every account status, find the product category with the highest number of purchases. If multiple categories are tied within a segment, include all of them. Show the account status, category, and purchase count, sequenced by account status and category.

## Worked solution and explanation

### What this is really asking

RANK() per `account_status` over a pre-aggregated (status, category) count. The 200M-row transactions table forces you to aggregate first, then rank the groups, not rank rows. Ties in a segment all stay (RANK, not ROW_NUMBER).

---

### Break down the requirements

#### Step 1: Join and count per pair

Join transactions to users (for account_status) and products (for category), then COUNT(*) grouped by (account_status, category). That collapses 200M rows to maybe a few hundred groups.

#### Step 2: Rank inside each segment

RANK() OVER (PARTITION BY account_status ORDER BY COUNT(*) DESC). Using RANK keeps every tied top category; ROW_NUMBER would silently drop ties to one winner.

#### Step 3: Filter and order

Keep rk = 1 and ORDER BY account_status, category. The outer sort is by category name, not by count, because all surviving rows share the top count within their segment.

---

### The solution

**TOP CATEGORY PER ACCOUNT STATUS**

```sql
WITH ranked AS (
  SELECT
    u.account_status,
    p.category,
    COUNT(*) AS purchase_count,
    RANK() OVER (
      PARTITION BY u.account_status
      ORDER BY COUNT(*) DESC
    ) AS rk
  FROM transactions t
  JOIN users u    ON t.user_id    = u.user_id
  JOIN products p ON t.product_id = p.product_id
  GROUP BY u.account_status, p.category
)
SELECT account_status, category, purchase_count
FROM ranked
WHERE rk = 1
ORDER BY account_status, category;
```

> **Cost Analysis**
>
> Two hash joins fan transactions out by user_id and product_id, then a single GROUP BY collapses to (status, category) cardinality. The window runs over that small set, so ranking cost is negligible. The shuffle on user_id dominates.

> **Interviewers Watch For**
>
> RANK vs DENSE_RANK vs ROW_NUMBER. The prompt says 'include all tied categories', so RANK or DENSE_RANK work; ROW_NUMBER drops ties. Pick on purpose.

> **Common Pitfall**
>
> Putting RANK() in the same SELECT as COUNT(*) without a CTE. The window function reads the aggregate from the same level, which works in SQLite/Postgres but trips people up. Wrapping it in `ranked` keeps the filter on rk readable.

---

### COMMON FOLLOW-UP QUESTIONS

## Common follow-up questions

- How would the answer change if you used DENSE_RANK instead of RANK? _(Same result here because we only keep rk = 1, but the values differ for non-winners; explain when each matters.)_
- What if an account_status has zero transactions? _(It disappears from the output. Discuss LEFT JOIN from a status dimension if you need every segment present, even empty ones.)_
- How would you weight by total_amount instead of count? _(Swap COUNT(*) for SUM(t.total_amount) in both the aggregate and the ORDER BY of the window. Ties become much rarer.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/top_category_by_user_segment)
- [SQL Interview Questions](https://datadriven.io/sql-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.