# Normalization Tradeoffs in Practice

> Clean data or fast queries? You can't always have both.

Canonical URL: <https://datadriven.io/problems/normalization_tradeoffs_in_practice>

Domain: SQL · Difficulty: hard · Seniority: L5

## Problem

The finance team needs total revenue per product category, where revenue for a transaction is defined as the quantity sold times the listed price of the product. Order results by category.

## Worked solution and explanation

### Why this problem exists in real interviews

Connecting transactions to products requires getting the join cardinality right before applying grouping and summation. Interviewers use this in senior-level rounds to test whether you verify row counts after each join step.

---

### Break down the requirements

#### Step 1: Join transactions to products

`JOIN products ON transactions.product_id = products.product_id` brings in `price` and `category`.

#### Step 2: Compute revenue per row

`quantity * price` gives the revenue for each transaction as specified (not `total_amount`).

#### Step 3: Group by category and sum

`GROUP BY category` with `SUM(quantity * price)` produces total revenue per category.

#### Step 4: Order by category

`ORDER BY category` for alphabetical output.

---

### The solution

**Join and compute revenue as quantity times price**

```sql
SELECT p.category, SUM(t.quantity * p.price) AS total_revenue
FROM transactions t
JOIN products p ON t.product_id = p.product_id
GROUP BY p.category
ORDER BY p.category
```

> **Cost Analysis**
>
> At `transactions` (100,000,000 rows), `products` (20,000 rows), a full table scan is expensive. Partition pruning (if the table is partitioned on the filter column) is the first optimization. A covering index on the `GROUP BY` + filter columns eliminates random I/O. Consider a materialized view for repeated dashboard queries.

> **Interviewers Watch For**
>
> Interviewers evaluate whether you translate the English requirements into the correct SQL clauses on the first attempt. They watch for clean syntax, correct column references, and whether you verify edge cases before declaring the query complete.

> **Common Pitfall**
>
> The most common mistake is misreading the prompt's filtering or grouping requirements. Double-check which columns to group by, which to aggregate, and whether the output should be filtered with `WHERE` (before grouping) or `HAVING` (after grouping).

---

## Common follow-up questions

- What happens to your result if products.rating contains NULLs for some rows? _(Tests whether the candidate accounts for NULL behavior in aggregates and comparisons on rating.)_
- If the join between transactions and products produces a fan-out, how does that affect your aggregate? _(Tests awareness of join cardinality and its impact on SUM, COUNT, and AVG results.)_
- With millions of distinct values in transactions.transaction_id, what index strategy would you use to keep this query performant? _(Tests indexing knowledge specific to high-cardinality columns like transaction_id.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/normalization_tradeoffs_in_practice)
- [SQL Interview Questions](https://datadriven.io/sql-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.