# Top Revenue Products H1

> First half of the year. Which products led the revenue race?

Canonical URL: <https://datadriven.io/problems/top_revenue_products_h1>

Domain: SQL · Difficulty: medium · Seniority: L4

## Problem

The revenue team needs the top 5 products by total revenue for January through June. Show each product's ID and total revenue. If products are tied at the cutoff, include all of them.

## Worked solution and explanation

### Why this problem exists in real interviews

H1 of which year? The prompt says January through June and the expected query uses `strftime('%m', ...)`, which folds every year in the table into the same H1 bucket. On a 100M-row `transactions` table partitioned by `transaction_date`, that wording choice matters twice: once for correctness against business intent, once because the strftime predicate prevents partition pruning.

---

### Break down the requirements

#### Step 1: Filter to months 1 through 6

Extract the month from `transaction_date` and keep 1..6. The expected query casts the strftime result to INTEGER so `BETWEEN 1 AND 6` is numeric, not lexicographic.

#### Step 2: Sum revenue per product

Group by `product_id` and `SUM(total_amount)`. The per-row amount is already the line total, so no `quantity * price` multiplication is needed.

#### Step 3: Rank with tie inclusion

`DENSE_RANK` on the descending sum, keep `rnk <= 5`. A 3-way tie at rank 5 returns 7 rows, which is what "include all of them" calls for.

---

### The solution

**TOP REVENUE PRODUCTS H1**

```sql
SELECT product_id, total_revenue
FROM (
  SELECT product_id,
         SUM(total_amount) AS total_revenue,
         DENSE_RANK() OVER (ORDER BY SUM(total_amount) DESC) AS rnk
  FROM transactions
  WHERE CAST(strftime('%m', transaction_date) AS INTEGER) BETWEEN 1 AND 6
  GROUP BY product_id
) ranked
WHERE rnk <= 5
ORDER BY total_revenue DESC
```

> **Cost Analysis**
>
> `strftime('%m', transaction_date)` is a non-sargable function call on the partition key, so partition pruning is off and the planner scans every partition. If this were production code against 100M rows, swap to a contiguous range like `transaction_date >= '2024-01-01' AND transaction_date < '2024-07-01'` per year of interest.

> **Interviewers Watch For**
>
> Ask which H1. If they say "all years pooled," the strftime form is correct. If they say "current fiscal year," the strftime form is a bug that silently aggregates 2023, 2024, and 2025 January through June together. Naming the ambiguity is more useful than picking either answer in silence.

> **Common Pitfall**
>
> Using `LIMIT 5` instead of `DENSE_RANK <= 5`. `LIMIT` is non-deterministic on ties, so a tie at rank 5 returns whichever five rows the executor happened to emit first. The prompt explicitly says include all tied products at the cutoff.

---

### COMMON FOLLOW-UP QUESTIONS

## Common follow-up questions

- How would you rewrite the predicate to keep partition pruning on while still expressing H1 cleanly? _(Probes whether you understand that wrapping the partition column in a function defeats the pruning optimization.)_
- If `total_amount` can be negative for refunds, does the ranking still reflect best sellers? _(Tests reasoning about gross vs net revenue and whether the metric matches business intent.)_
- How would you produce a year-over-year comparison for the same top 5 products? _(Checks whether you can pivot the result using conditional aggregation or a self-join on year.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/top_revenue_products_h1)
- [SQL Interview Questions](https://datadriven.io/sql-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.