# The Clean Aisle Numbers

> Clear the noise. What did each category actually earn?

Canonical URL: <https://datadriven.io/problems/deduplicated_sales_volume_by_category>

Domain: SQL · Difficulty: medium · Seniority: L5

## Problem

The transactions table contains duplicate records. First deduplicate by keeping only the first occurrence of each transaction (the one with the lowest transaction_id when multiple rows share the same user_id, product_id, total_amount, and transaction_date). Then combine with the products table and compute the total sales amount per product category, only counting transactions where total_amount is positive.

## Worked solution and explanation

### Why this problem exists in real interviews

This tests multi-step deduplication with a specific dedup key, followed by a join and filtered aggregation. The complexity lies in the dedup definition: keep the lowest transaction_id among rows sharing (user_id, product_id, total_amount, transaction_date).

---

### Break down the requirements

#### Step 1: Deduplicate transactions

Use `ROW_NUMBER() OVER (PARTITION BY user_id, product_id, total_amount, transaction_date ORDER BY transaction_id ASC)` and keep row 1.

#### Step 2: Join to products

Join deduplicated transactions to `products` on `product_id` for the category.

#### Step 3: Filter and aggregate

`WHERE total_amount > 0` removes non-positive amounts. `GROUP BY category` with `SUM(total_amount)`.

---

### The solution

**Multi-key deduplication with join and filtered sum**

```sql
WITH deduped AS (
    SELECT *,
        ROW_NUMBER() OVER (
            PARTITION BY user_id, product_id, total_amount, transaction_date
            ORDER BY transaction_id ASC
        ) AS rn
    FROM transactions
)
SELECT p.category, SUM(d.total_amount) AS total_sales
FROM deduped d
JOIN products p ON d.product_id = p.product_id
WHERE d.rn = 1
  AND d.total_amount > 0
GROUP BY p.category
```

> **Cost Analysis**
>
> Window function over 100M rows with a 4-column partition key. The sort is expensive. After dedup and filtering, the join to 40K products is cheap. Output is a handful of categories.

> **Interviewers Watch For**
>
> Whether the candidate partitions by all four dedup columns (user_id, product_id, total_amount, transaction_date) and orders by transaction_id to keep the first occurrence.

> **Common Pitfall**
>
> Deduplicating on fewer columns (e.g., just transaction_id) would not catch the actual duplicates. The dedup key must match the problem's definition exactly.

---

## Common follow-up questions

- How would you measure how many duplicates were removed? _(Compare COUNT(*) before and after dedup, or count rows where rn > 1.)_
- What if total_amount could be zero? _(The > 0 filter excludes zero amounts. Discuss whether zero-value transactions should be included.)_
- How would you handle this dedup in a production pipeline? _(Tests idempotent dedup strategies: staging tables, MERGE/UPSERT, or hash-based dedup.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/deduplicated_sales_volume_by_category)
- [SQL Interview Questions](https://datadriven.io/sql-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.