# Popular Categories

> Merchandising only cares about categories big enough to negotiate shelf space.

Canonical URL: <https://datadriven.io/problems/popular_categories>

Domain: SQL · Difficulty: easy · Seniority: L3

## Problem

The merchandising team is negotiating shelf-space contracts and only wants to discuss product categories that have meaningful scale. For each category, show the number of products, the average rating, and the average price. Only include categories with more than eight products. Present them from highest average rating to lowest.

## Worked solution and explanation

### Why this problem exists in real interviews

This product analytics problem uses the `products` table to evaluate HAVING for post-aggregation filtering. Watch how columns like `category`, `price`, and `rating` interact in the grouping and filtering logic.

---

### Break down the requirements

#### Step 1: Group by category

`GROUP BY category` collapses all products into one row per category.

#### Step 2: Compute product count, avg rating, avg price

`COUNT(*)`, `AVG(rating)`, and `AVG(price)` produce the three requested metrics.

#### Step 3: Filter with HAVING

`HAVING COUNT(*) > 8` excludes small categories.

#### Step 4: Order by avg rating descending

`ORDER BY AVG(rating) DESC` surfaces the highest-rated categories first.

---

### The solution

**Having filter for popular categories**

```sql
SELECT
    category,
    COUNT(*) AS product_count,
    AVG(rating) AS avg_rating,
    AVG(price) AS avg_price
FROM products
GROUP BY category
HAVING COUNT(*) > 8
ORDER BY avg_rating DESC
```

> **Cost Analysis**
>
> With `products` (6,000 rows), this query scans a small dataset. No indexing is needed for this volume. At production scale, an index on the primary filter column would improve performance.

> **Interviewers Watch For**
>
> Interviewers evaluate whether you translate the English requirements into the correct SQL clauses on the first attempt. They watch for clean syntax, correct column references, and whether you verify edge cases before declaring the query complete.

> **Common Pitfall**
>
> The most common mistake is misreading the prompt's filtering or grouping requirements. Double-check which columns to group by, which to aggregate, and whether the output should be filtered with `WHERE` (before grouping) or `HAVING` (after grouping).

---

## Common follow-up questions

- If `rating` in `products` is NULL for some rows, how would your aggregation or join logic be affected? _(Probes understanding of NULL propagation through joins and aggregate functions on `products.rating`.)_
- The `in_stock` column in `products` is dominated by a single hot value. How would data skew affect parallel execution of your query? _(Tests understanding of skew in `products.in_stock` and its impact on distributed query performance.)_
- If the HAVING threshold in your query changed from a fixed number to a percentile, how would you restructure the query? _(Tests ability to replace static HAVING filters with dynamic subquery-based thresholds.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/popular_categories)
- [SQL Interview Questions](https://datadriven.io/sql-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.