# Cost Efficiency Variance

> Cost efficiency varies. By how much?

Canonical URL: <https://datadriven.io/problems/cost_efficiency_variance>

Domain: SQL · Difficulty: hard · Seniority: L5

## Problem

For each billing entry, compute the cost-per-service ratio (amount divided by the number of services in that region). Then find the monthly average of these ratios. For each year-month, show the average ratio, the monthly average, and the average absolute difference between individual ratios and that month's average.

## Worked solution and explanation

### Why this problem exists in real interviews

Cloud bills are messy and finance asks for variance metrics that sound simple until you discover the grain is per line item, not per service. The shape here mimics a FinOps dashboard query: normalize cost by region service count, then quantify month-to-month dispersion. Get the denominator wrong and every downstream chargeback is off.

---

### Break down the requirements

#### Step 1: Build the per-region denominator

Count distinct `svc_name` per `region`. This is a small lookup table, region cardinality is tiny compared to 18M rows. Use COUNT(DISTINCT) because `svc_name` repeats across many line items per region.

#### Step 2: Compute the per-row ratio

Join each billing row to its region's service count. Cast `amount` to REAL so integer division does not silently truncate the ratio to zero. Bucket the row into a year-month string for grouping.

#### Step 3: Find the monthly mean, then the deviation

Aggregate `cost_ratio` per `ym` to get the mean. Rejoin each row to its month's mean, take ABS of the difference, then average those absolute differences per month. That last average is mean absolute deviation.

---

### The solution

**MONTHLY MEAN ABSOLUTE DEVIATION OF COST RATIOS**

```sql
WITH svc_per_region AS (
  SELECT region, COUNT(DISTINCT svc_name) AS svc_count
  FROM cloud_costs
  GROUP BY region
),
with_ratio AS (
  SELECT cc.cost_id, cc.svc_name, cc.region, cc.amount, cc.bill_date,
         strftime('%Y-%m', cc.bill_date) AS ym,
         CAST(cc.amount AS REAL) / sr.svc_count AS cost_ratio
  FROM cloud_costs cc
  INNER JOIN svc_per_region sr ON cc.region = sr.region
),
monthly_avg AS (
  SELECT ym, AVG(cost_ratio) AS avg_ratio
  FROM with_ratio
  GROUP BY ym
),
diffs AS (
  SELECT wr.ym,
         wr.cost_ratio AS actual_ratio,
         ma.avg_ratio AS monthly_average,
         ABS(wr.cost_ratio - ma.avg_ratio) AS abs_diff
  FROM with_ratio wr
  INNER JOIN monthly_avg ma ON wr.ym = ma.ym
)
SELECT ym,
       AVG(actual_ratio) AS actual_ratio,
       AVG(monthly_average) AS monthly_average,
       AVG(abs_diff) AS avg_abs_difference
FROM diffs
GROUP BY ym
ORDER BY ym
```

> **Cost Analysis**
>
> Two full passes over 18M rows: one for `svc_per_region`, one for `with_ratio`. The monthly_avg aggregation is cheap because it operates on the already-projected ratio CTE. Partitioning on `bill_date` does not help here since both passes scan globally; region-aware clustering would.

> **Interviewers Watch For**
>
> Whether you spot that `AVG(actual_ratio)` and `AVG(monthly_average)` produce the same number per `ym` group. That redundancy is fine if the spec asks for both columns, but flag it out loud so the reviewer knows you understand the math, not just the syntax.

> **Common Pitfall**
>
> Forgetting the CAST. `amount / svc_count` with both sides INTEGER returns 0 for any row where amount is less than svc_count, which on a per-line-item bill is most of them. The MAD would then collapse to noise around zero.

---

### COMMON FOLLOW-UP QUESTIONS

## Common follow-up questions

- How would you switch this to use standard deviation instead of mean absolute deviation? _(Probes whether you know STDDEV_SAMP is a single aggregate and can replace the self-join on `monthly_avg`.)_
- If `svc_count` per region changes over time, how should the ratio handle that? _(Tests slowly-changing dimension thinking: should `svc_per_region` be partitioned by `ym` too?)_
- Could you compute monthly MAD with a window function and skip the self-join? _(Checks fluency with AVG() OVER (PARTITION BY ym) feeding ABS(cost_ratio - avg) in one pass.)_

> **MAD vs Variance**
>
> Mean absolute deviation is more robust to outliers than variance because squaring amplifies large gaps. Finance teams often prefer MAD on cost data precisely because a single misclassified mega-invoice would swamp a stddev metric.

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/cost_efficiency_variance)
- [SQL Interview Questions](https://datadriven.io/sql-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.