# Median Transaction by Category

> The middle transaction in each category.

Canonical URL: <https://datadriven.io/problems/median_transaction_by_category>

Domain: SQL · Difficulty: hard · Seniority: L4

## Problem

Compute the median transaction amount for each product category. Show each category alongside its median.

## Worked solution and explanation

### Why this problem exists in real interviews

BI dashboards quote 'typical order size' as a median because mean is corrupted by enterprise outliers. Interviewers love this variant because it adds a join (transactions to products) on top of the rank pair median pattern, testing whether the candidate keeps the join inside the windowed CTE rather than trying to median per category and join after.

---

### Break down the requirements

#### Step 1: Inner join inside the windowed CTE

Join transactions to products on product_id BEFORE windowing. The category is what we partition by, and it lives on products. Doing the join first lets ROW_NUMBER and COUNT see the full enriched row.

#### Step 2: Apply the rank pair median windowing

ROW_NUMBER() OVER (PARTITION BY p.category ORDER BY t.total_amount) for the position, COUNT(*) OVER (PARTITION BY p.category) for the partition size. The same (cnt+1)/2 and (cnt+2)/2 pair handles odd and even counts in pure SQLite.

#### Step 3: Average the picked rows per category

WHERE rn IN ((cnt+1)/2, (cnt+2)/2) keeps the middle one or two rows per category. AVG(total_amount) GROUP BY category. ORDER BY category alphabetically as the prompt requires.

---

### The solution

**Join inside CTE then rank pair median**

```sql
WITH ranked AS (
  SELECT p.category, t.total_amount,
         ROW_NUMBER() OVER (PARTITION BY p.category ORDER BY t.total_amount) AS rn,
         COUNT(*) OVER (PARTITION BY p.category) AS cnt
  FROM transactions t INNER JOIN products p ON t.product_id = p.product_id
)
SELECT category, AVG(total_amount) AS median_amount
FROM ranked
WHERE rn IN ((cnt + 1) / 2, (cnt + 2) / 2)
GROUP BY category
ORDER BY category
```

> **Cost Analysis**
>
> transactions at 180M rows joined to products at 50K is the cost driver. The hash join on product_id is one pass over transactions plus a build of the products hash. Then the window pass sorts the joined output by (category, total_amount). With a small number of categories, partition local sorts are cheap; the dominant cost is the join and the global sort prerequisite.

> **Interviewers Watch For**
>
> Whether you joined inside the windowed CTE (so the partition by category sees all rows correctly), whether you used PERCENTILE_CONT (which is not SQLite portable), and whether you handle even counts with the rank pair AVG rather than picking just one row.

> **Common Pitfall**
>
> Computing the median per product first and then averaging by category gives a wrong answer (median of medians is not median). The join must happen at the row level, then the window medians by category over the joined rows.

---

## Common follow-up questions

- Why is median of medians not equal to the true median by category? _(Tests statistical literacy. Aggregating then aggregating again loses the underlying distribution; outliers in one product can be hidden by the per product median.)_
- How would you weight the median by quantity instead of treating each transaction row equally? _(Tests advanced statistics. Replace ROW_NUMBER with a running SUM of quantity, and pick the rows where the cumulative weight crosses half of cnt_total.)_
- What changes if some transactions have NULL total_amount? _(Tests data hygiene. The current ORDER BY puts NULL first or last depending on the engine; in SQLite NULL sorts first ascending. The candidate should add a WHERE total_amount IS NOT NULL inside the CTE or accept that the rank pair will pick a NULL median for sparse categories.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/median_transaction_by_category)
- [SQL Interview Questions](https://datadriven.io/sql-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.