# Top Cost Categories

> Three categories eating the budget.

Canonical URL: <https://datadriven.io/problems/top_cost_categories>

Domain: SQL · Difficulty: medium · Seniority: L3

## Problem

Categories are tracked in the budget allocation system, but actual billing amounts live in the cost tracking system. Attribute each cloud charge to the correct budget category by matching the allocation for that service and region. Show the top 3 categories by total amount.

## Worked solution and explanation

### Why this problem exists in real interviews

They're probing whether you force a conversation about grain before you join two 12M+ row partitioned tables. `cost_allocs` has `(svc_name, region, period, team_name)` granularity, so a naive join on `(svc_name, region)` fans out. They also want to see if you ask about `bill_date` scoping before scanning every partition.

---

### Break down the requirements

#### Step 1: Pin down the category source

Ask: `is category on cloud_costs or only on cost_allocs?`. The schema says `cost_allocs.category`, so you must join to get it. If they had wanted a derived category from `svc_name`, you'd write a `CASE` instead.

#### Step 2: Define the join key

`ON cc.svc_name = ca.svc_name AND cc.region = ca.region`. Verify with the interviewer that `(svc_name, region)` uniquely identifies a category in `cost_allocs`. If multiple `team_name` rows share that key, you'll double-count `cc.amount`.

#### Step 3: Ask about partition pruning

Both tables are partitioned. Ask: `is there a bill_date or period window?`. If yes, add `WHERE cc.bill_date BETWEEN ... AND ...` and the matching `ca.period` predicate so the planner prunes partitions instead of full-scanning 27M rows.

#### Step 4: Aggregate and rank

`SUM(cc.amount)` grouped by `ca.category`, `ORDER BY total_amount DESC LIMIT 3`. `LIMIT 3` is fine here because the prompt says `top 3 categories`, no `tiers` language, no tie-handling requirement implied.

---

### The solution

**TOP 3 CATEGORIES BY ATTRIBUTED COST**

```sql
SELECT ca.category,
       SUM(cc.amount) AS total_amount
FROM cloud_costs cc
INNER JOIN cost_allocs ca
  ON cc.svc_name = ca.svc_name
 AND cc.region   = ca.region
GROUP BY ca.category
ORDER BY total_amount DESC
LIMIT 3;
```

> **Cost Analysis**
>
> 12M `cloud_costs` rows hash-joined to 15M `cost_allocs` on `(svc_name, region)`. Without a time predicate, both tables full-scan every partition. With `WHERE cc.bill_date >= '2026-04-01' AND ca.period >= '2026-04-01'`, the planner prunes partitions on both sides and the join shrinks an order of magnitude.

> **Interviewers Watch For**
>
> Ask aloud: `is category a column on cloud_costs or only on cost_allocs?` and `is (svc_name, region) unique in cost_allocs?`. If the answer to the second is no, you need a dedup CTE on `cost_allocs` before joining or `cc.amount` gets multiplied per `team_name` row.

> **Common Pitfall**
>
> Summing `ca.amount` instead of `cc.amount`. The prompt says `actual billing amounts live in the cost tracking system`, which is `cloud_costs`. `cost_allocs.amount` is the budget allocation, a different metric. Pick the wrong column and the answer is plausible but wrong.

---

### COMMON FOLLOW-UP QUESTIONS

## Common follow-up questions

- What if `(svc_name, region)` isn't unique in `cost_allocs` because each team has its own row? _(Forces a pre-aggregation CTE or `DISTINCT ON` to collapse `cost_allocs` to one row per join key before joining.)_
- Scope this to Q1 2026 only. Where do the predicates go? _(Tests partition pruning awareness on both `cc.bill_date` and `ca.period`, not just one side.)_
- What if a `cloud_costs` row has a `svc_name` with no matching allocation? _(Probes whether they switch to `LEFT JOIN` and bucket unmatched spend into an `'unallocated'` category via `COALESCE`.)_
- Return the top 3 categories per region instead of overall. _(Pushes them to `ROW_NUMBER() OVER (PARTITION BY region ORDER BY SUM(amount) DESC)` in an outer layer.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/top_cost_categories)
- [SQL Interview Questions](https://datadriven.io/sql-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.