# Top Product Category by Transactions

> Organic purchases, no marketing nudge. Which category wins?

Canonical URL: <https://datadriven.io/problems/top_product_category_by_transactions>

Domain: SQL · Difficulty: medium · Seniority: L4

## Problem

Which product category had the highest transaction volume in 2026? Return the category name and transaction count. If there's a tie, include all tied categories.

## Worked solution and explanation

### Why this problem exists in real interviews

This is the canonical fact-to-dimension lookup. `transactions` carries `product_id`; the attribute you actually want (`category`) lives one hop away on `products`. The interviewer is checking that you reach for a JOIN + GROUP BY on the dim attribute (not on the FK), and that you handle ties without hand-waving. 100M rows on one side, 30K on the other: shapes the plan you describe out loud.

---

### Break down the requirements

#### Step 1: Filter the fact first

Apply `WHERE strftime('%Y', t.transaction_date) = '2026'` on `transactions` before joining. Partition is `transaction_date`, so this is the only clause that matters for cost.

#### Step 2: Join to the dim

`JOIN products p ON t.product_id = p.product_id`. INNER is fine if you trust referential integrity. The join exists only to pull `p.category` onto each row.

#### Step 3: Aggregate on the attribute

`GROUP BY p.category`, `COUNT(*) AS transaction_count`. Grouping on the dim attribute (not `product_id`) is the whole point: same category, many products, one bucket.

#### Step 4: Surface ties, not one arbitrary row

`LIMIT 1` silently drops ties. Use `HAVING COUNT(*) = (SELECT MAX(cnt) FROM (... grouped subquery ...))` so every tied category comes back. `ORDER BY p.category` for stable output.

---

### The solution

**GROUP-AGGREGATE WITH TIE-SAFE FILTER**

```sql
SELECT p.category,
       COUNT(*) AS transaction_count
FROM transactions t
JOIN products p ON t.product_id = p.product_id
WHERE strftime('%Y', t.transaction_date) = '2026'
GROUP BY p.category
HAVING COUNT(*) = (
  SELECT MAX(cnt) FROM (
    SELECT COUNT(*) AS cnt
    FROM transactions t2
    JOIN products p2 ON t2.product_id = p2.product_id
    WHERE strftime('%Y', t2.transaction_date) = '2026'
    GROUP BY p2.category
  )
)
ORDER BY p.category;
```

> **Cost Analysis**
>
> 30K-row `products` is tiny, planner will hash-build on it and probe from `transactions`. The dominant cost is scanning the right year-partition of the 100M-row fact, so the year filter MUST prune partitions (avoid wrapping `transaction_date` in a function that defeats pruning on engines that care; SQLite tolerates `strftime`, Postgres/Snowflake do not). An index on `transactions.product_id` is helpful for the probe; nothing else moves the needle.

> **Interviewers Watch For**
>
> Whether you ask `Could two categories tie?` before writing the query. If you reach straight for `ORDER BY ... LIMIT 1`, the senior interviewer notes that you silently picked one. The prompt explicitly says include all tied categories, so a `HAVING max` or `DENSE_RANK = 1` shape is the correct read.

> **Common Pitfall**
>
> Writing `GROUP BY t.product_id` and then projecting `p.category` collapses each product into its own bucket instead of rolling categories up. You get the top-selling product's category, not the top category. The fix is grouping on `p.category` directly.

**LIMIT 1 (ties hidden)**

`ORDER BY transaction_count DESC LIMIT 1` returns one category. If Electronics and Apparel both have 9,812,334 transactions, you ship one and call it done. Wrong answer per the prompt.

**HAVING = MAX (ties surfaced)**

`HAVING COUNT(*) = (SELECT MAX(cnt) FROM grouped)` returns every category at the peak. `DENSE_RANK() OVER (ORDER BY cnt DESC)` filtered to rank 1 is the window-function equivalent.

> **Key Insight**
>
> Group on the attribute you want to rank (`p.category`), not the join key (`product_id`). The join is plumbing; the GROUP BY is the answer.

---

### COMMON FOLLOW-UP QUESTIONS

## Common follow-up questions

- What changes if some `transactions.product_id` values point to deleted products? _(Probes INNER vs LEFT JOIN and whether you'd bucket orphans into a `COALESCE(category, 'unknown')` group or drop them silently.)_
- Now return the top category per month for the year. How does the query change? _(Forces a per-group ranking shape: `DENSE_RANK() OVER (PARTITION BY month ORDER BY cnt DESC)` filtered to rank 1, instead of one global MAX.)_
- If `transactions` grew to 10B rows, what would you change? _(Tests whether you reach for partition pruning, pre-aggregated daily rollups, or materialized views, and whether you recognize the dim join cost is still negligible.)_
- Could you answer this without joining, given the schema? _(Checks if you'd denormalize `category` onto `transactions` (or use a mapping table cached client-side) when this query runs hourly on a dashboard.)_
- Why `COUNT(*)` and not `COUNT(t.transaction_id)` or `SUM(quantity)`? _(Probes the difference between row-count, non-null-count, and unit-count, and which one matches `transaction volume` in the prompt.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/top_product_category_by_transactions)
- [SQL Interview Questions](https://datadriven.io/sql-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.