# Above Category Average

> The category average is one thing. These beat it.

Canonical URL: <https://datadriven.io/problems/above_category_average>

Domain: SQL · Difficulty: easy · Seniority: L5

## Problem

The merchandising team wants to identify outperforming products ahead of the quarterly review. Find every product whose average transaction amount runs higher than the overall average transaction amount within its own category. Return the product name, its average transaction amount, and the category's average transaction amount.

## Worked solution and explanation

### Why this problem exists in real interviews

"Compare each row to its group average" is one of the most common questions in analytics interviews because it forces the candidate to produce two aggregates at different grains in the same query. The category average is the pooled mean across every transaction in the category, not an average of per-product averages; that distinction is the heart of the question.

> **Trick to Solving**
>
> Two aggregates at two grains. Per-product mean groups by `product_id`. Per-category mean groups by `category` only, over the same underlying transactions (pooled). Join the two on `category` and keep rows where product mean is strictly greater.

---

### Break down the requirements

#### Step 1: Compute each product's mean transaction amount

First CTE: join `products` to `transactions` on `product_id`, `GROUP BY product_id, product_name, category`, and compute `AVG(total_amount)`. Products with no transactions do not appear.

#### Step 2: Compute each category's pooled mean

Second CTE: join the same two tables, `GROUP BY category`, and compute `AVG(total_amount)`. This is `SUM(total_amount) / COUNT(*)` across every transaction in the category, which weights a product by how many times it sold. It is not an average of per-product averages.

#### Step 3: Join the two CTEs on category

Each product maps to exactly one category, so each product row picks up its one category average. The result stays at the product grain with the pooled category average attached.

#### Step 4: Keep strictly-above products

`WHERE product_avg > cat_avg`. Strictly greater, so ties do not qualify. Return `product_name`, `product_avg`, and `cat_avg` (renamed `category_avg` in the output).

---

### The solution

**Two-grain aggregation with strict filter**

```sql
WITH product_avg AS (
    SELECT p.product_id,
           p.product_name,
           p.category,
           AVG(t.total_amount) AS product_avg
    FROM products p
    JOIN transactions t ON t.product_id = p.product_id
    GROUP BY p.product_id, p.product_name, p.category
),
category_avg AS (
    SELECT p.category,
           AVG(t.total_amount) AS cat_avg
    FROM products p
    JOIN transactions t ON t.product_id = p.product_id
    GROUP BY p.category
)
SELECT pa.product_name,
       pa.product_avg,
       ca.cat_avg AS category_avg
FROM product_avg pa
JOIN category_avg ca ON ca.category = pa.category
WHERE pa.product_avg > ca.cat_avg
```

> **Cost Analysis**
>
> `transactions` has 200M rows; `products` has 50K. Both CTEs scan `transactions` once each, so the query reads 400M rows total. A window alternative (`AVG(total_amount) OVER (PARTITION BY category)` computed alongside the per-product aggregate in a single pass) is cheaper at scale because the table is read once.

> **Interviewers Watch For**
>
> The highest-signal detail is recognizing the pooled category average. Candidates who compute the category mean as an average of per-product averages get a different answer, usually smaller, because large-volume products no longer dominate. Strong candidates also offer the single-pass window-function variant as the performance-conscious alternative.

> **Common Pitfall**
>
> Computing the category average as `AVG(product_avg)` (an average of averages) is the classic trap. That weights every product equally, ignoring transaction volume. The prompt is explicit: the category mean is pooled across every transaction in the category, which is a plain `AVG(total_amount)` grouped by category only.

---

## Common follow-up questions

- Rewrite this in a single pass using a window function. _(Group by product with per-product `AVG(total_amount)`, then use aggregate-over-window: `SUM(SUM(total_amount)) OVER (PARTITION BY category) * 1.0 / SUM(COUNT(*)) OVER (PARTITION BY category)` computes the pooled category average from the per-product sums and counts in the same pass. Filter where product avg is greater. Reads `transactions` once.)_
- How would the answer change if "category average" meant an average of per-product averages? _(You would aggregate to per-product first, then take the mean of those. This weights every product equally and typically shifts the category mean toward low-volume products.)_
- How would you include products with no transactions, showing NULL for their average? _(`LEFT JOIN transactions` in the per-product CTE. Those products get NULL `product_avg`, which fails the strict `>` comparison and drops out anyway; add `COALESCE` if you want them retained as zeros.)_
- How would you rank products by how far above their category average they sit? _(Add `(product_avg - cat_avg) AS lift` and `ORDER BY lift DESC`. You can also return a percentage as `(product_avg / cat_avg - 1)`.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/above_category_average)
- [SQL Interview Questions](https://datadriven.io/sql-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.