# Product Ratings vs Sales

> Do higher ratings actually mean more revenue?

Canonical URL: <https://datadriven.io/problems/product_ratings_vs_sales>

Domain: SQL · Difficulty: medium · Seniority: L3

## Problem

Strategy wants to know whether higher-rated product categories actually pull more revenue. For each category (excluding products with no rating), compute the average rating and total transaction revenue.

## Worked solution and explanation

### Why this problem exists in real interviews

This problem targets self-join across the `products` and `transactions` tables. You need to work with columns like `product_id`, `rating`, and `category` to satisfy the requirements.

---

### Break down the requirements

#### Step 1: Join products to transactions

`JOIN transactions ON products.product_id = transactions.product_id` connects each product to its sales.

#### Step 2: Exclude products with NULL rating

`WHERE products.rating IS NOT NULL` removes unrated products.

#### Step 3: Group by category

`GROUP BY category` with `AVG(rating)` and `SUM(total_amount)` produces the two metrics per category.

---

### The solution

**Join products to transactions to find product ratings vs sales**

```sql
SELECT
    p.category,
    AVG(p.rating) AS avg_rating,
    SUM(t.total_amount) AS total_revenue
FROM products p
JOIN transactions t ON p.product_id = t.product_id
WHERE p.rating IS NOT NULL
GROUP BY p.category
```

> **Cost Analysis**
>
> At `products` (25,000 rows), `transactions` (100,000,000 rows), a full table scan is expensive. Partition pruning (if the table is partitioned on the filter column) is the first optimization. A covering index on the `GROUP BY` + filter columns eliminates random I/O. Consider a materialized view for repeated dashboard queries.

> **Interviewers Watch For**
>
> Interviewers evaluate whether you translate the English requirements into the correct SQL clauses on the first attempt. They watch for clean syntax, correct column references, and whether you verify edge cases before declaring the query complete.

> **Common Pitfall**
>
> The most common mistake is misreading the prompt's filtering or grouping requirements. Double-check which columns to group by, which to aggregate, and whether the output should be filtered with `WHERE` (before grouping) or `HAVING` (after grouping).

---

## Common follow-up questions

- What result would you get if every value in `products.rating` were NULL? Would your query return an empty set or something unexpected? _(Tests extreme NULL scenarios and whether the candidate guards against edge cases in `rating`.)_
- With 100,000,000 distinct values in `transactions.transaction_id`, how would a composite index on the GROUP BY columns change the execution plan? _(Probes understanding of how cardinality in `transaction_id` affects grouping and sort operations.)_
- If `products` contained late-arriving rows that were inserted after your query ran, how would you design an incremental update instead of re-aggregating? _(Tests understanding of incremental aggregation patterns.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/product_ratings_vs_sales)
- [SQL Interview Questions](https://datadriven.io/sql-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.