# The Lion's Share

> Every category claims its slice. Find out who really owns the table.

Canonical URL: <https://datadriven.io/problems/the-lions-share-revenue-concentration>

Domain: SQL · Difficulty: medium · Seniority: mid

## Problem

The merchandising finance team suspects our revenue is far more concentrated than the catalog's breadth suggests: a handful of product categories may be quietly carrying the whole business while the rest contribute noise. To make the next planning cycle's investment case, they want a clear picture of how purchase revenue is distributed across product categories. For each category, show how much total purchase revenue it brought in, how many purchases it took to get there, and what portion of the company's overall revenue that category represents, presented from the biggest earner down to the smallest so the concentration is obvious at a glance.

## Worked solution and explanation

### Why this problem exists in real interviews

This is the canonical 'revenue concentration' case study, and interviewers love it because it forces you to combine a per-group aggregation with a whole-table total in one pass. The naive instinct is to compute category revenue, then run a second query for the grand total, then divide. A strong candidate recognizes that the grand total can be computed alongside the grouped sums using an aggregate over a window, avoiding the extra round trip. Here you join transactions.total_amount to its products.category, sum revenue per category, count the purchases behind it, and express each category as a share of SUM over the whole result set.

---

### Break down the requirements

#### Step 1: Attach each purchase to a category

transactions has no category column, so each row must be matched to its product via product_id against products. An inner join is correct here: a purchase with no matching product carries no category to report on, and there is no requirement to surface empty categories.

#### Step 2: Collapse to one row per category

Grouping by products.category lets you fold every purchase in a category into a single output row. SUM(total_amount) gives the category revenue and COUNT(*) gives how many purchases produced it.

#### Step 3: Express each category against the whole

The share-of-total needs the grand total of all category revenue. SUM(SUM(total_amount)) OVER () computes the total of the per-group sums across the full result, so dividing the category sum by it (times 100) yields the percentage without a second query. ROUND keeps it readable, and ordering by total_revenue descending makes the concentration jump out.

---

### The solution

**Revenue share by category**

```sql
SELECT p.category AS category, SUM(t.total_amount) AS total_revenue, COUNT(*) AS transaction_count, ROUND(SUM(t.total_amount) * 100.0 / SUM(SUM(t.total_amount)) OVER (), 2) AS revenue_share_pct FROM transactions t JOIN products p ON t.product_id = p.product_id GROUP BY p.category ORDER BY total_revenue DESC, category ASC;
```

> **Cost Analysis**
>
> At realistic scale transactions is the heavy side, roughly 80M rows and 12 GB, while products is a small dimension of a few hundred thousand rows. The planner hash-joins the small product dimension into a probe over transactions, so the join is effectively a single streaming pass. The window function adds no extra scan: the SUM(SUM(...)) OVER () is evaluated after grouping over the handful of category rows, which is negligible. Partitioning or clustering transactions by transaction_date does not help this particular query since it touches all history, but a covering index on transactions(product_id, total_amount) lets the join probe stay off the base heap.

> **Interviewers Watch For**
>
> The signal is whether you compute the grand total in the same pass with a windowed aggregate rather than issuing a separate scalar subquery or a second query. Strong candidates also explicitly justify inner versus left join (categories with zero purchases are intentionally absent here) and multiply by 100.0 rather than 100 to avoid integer truncation.

> **Common Pitfall**
>
> The most common mistake is integer division: SUM(total_amount) * 100 / total written with integer literals truncates the percentage to whole numbers or zero. Using 100.0 forces floating-point math. A close second is forgetting that you can nest aggregates inside a window, leading people to compute the total with a correlated subquery that rescans the table per category.

---

## Common follow-up questions

- How would you change the query to also include categories that exist in the catalog but have never been purchased, showing them with zero revenue? _(Tests understanding of switching to a left join from products to transactions and handling NULL sums with COALESCE.)_
- If the team only cares about the categories that together make up the top 80% of revenue, how would you extend this? _(Tests layering a running cumulative share with an ordered window over the per-category result, i.e. a Pareto / running-total pattern.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/the-lions-share-revenue-concentration)
- [SQL Interview Questions](https://datadriven.io/sql-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.