# Over-Budget Services

> Over budget. Flagged.

Canonical URL: <https://datadriven.io/problems/over_budget_services>

Domain: SQL · Difficulty: medium · Seniority: L4

## Problem

Combine budget allocations with actual cloud charges for each service during the same billing period, and pull in any related batch job records. Prorate each service's cost by multiplying allocated amount by actual cloud cost and dividing by hours in a year. Show service name and total prorated cost for services with positive prorated cost, ranked highest first.

## Worked solution and explanation

### Why this problem exists in real interviews

The core test is combining rows from cloud_costs and cost_allocs using HAVING filter and grouping, where a fan-out on the join key will silently corrupt aggregates. This appears in mid-level screens to separate careful thinkers from syntax-first candidates.

---

### Break down the requirements

#### Step 1: Join cloud_costs to cost_allocs on service and period

`JOIN cost_allocs ON cloud_costs.svc_name = cost_allocs.svc_name AND cloud_costs.bill_date = cost_allocs.period` matches actual costs to budgets.

#### Step 2: Compute prorated cost

`cost_allocs.amount * cloud_costs.amount / 8760` applies the proration formula (hours in a year).

#### Step 3: Group by service and sum

`GROUP BY svc_name` with `SUM(...)` produces total prorated cost per service.

#### Step 4: Filter and order

`HAVING SUM(...) > 0 ORDER BY total_prorated DESC` shows positive-cost services ranked highest first.

---

### The solution

**Multi-table join with proration formula**

```sql
SELECT
    cc.svc_name,
    SUM(ca.amount * cc.amount / 8760.0) AS total_prorated
FROM cloud_costs cc
JOIN cost_allocs ca
  ON cc.svc_name = ca.svc_name
  AND cc.bill_date = ca.period
GROUP BY cc.svc_name
HAVING SUM(ca.amount * cc.amount / 8760.0) > 0
ORDER BY total_prorated DESC
```

> **Cost Analysis**
>
> With `cloud_costs` (15,000,000 rows), `cost_allocs` (20,000,000 rows), `batch_jobs` (500,000 rows), the full scan reads significant data. A composite index on the filter columns pushes the predicate into the index layer. Pre-aggregation in a materialized view is worth considering at this scale.

> **Interviewers Watch For**
>
> Interviewers evaluate whether you translate the English requirements into the correct SQL clauses on the first attempt. They watch for clean syntax, correct column references, and whether you verify edge cases before declaring the query complete.

> **Common Pitfall**
>
> The most common mistake is misreading the prompt's filtering or grouping requirements. Double-check which columns to group by, which to aggregate, and whether the output should be filtered with `WHERE` (before grouping) or `HAVING` (after grouping).

---

## Common follow-up questions

- What happens to your result if batch_jobs.ended contains NULLs for some rows? _(Tests whether the candidate accounts for NULL behavior in aggregates and comparisons on ended.)_
- What is the difference between filtering in WHERE versus HAVING for this query against cloud_costs? _(Tests whether the candidate understands pre-aggregation vs post-aggregation filtering.)_
- With millions of distinct values in cloud_costs.cost_id, what index strategy would you use to keep this query performant? _(Tests indexing knowledge specific to high-cardinality columns like cost_id.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/over_budget_services)
- [SQL Interview Questions](https://datadriven.io/sql-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.