# Median Cloud Cost by Service

> The median cloud bill, service by service.

Canonical URL: <https://datadriven.io/problems/median_cloud_cost_by_service>

Domain: SQL · Difficulty: hard · Seniority: L5

## Problem

Compute the median cloud cost amount for each service. Show each service and its median amount, from highest median to lowest.

## Worked solution and explanation

### Why this problem exists in real interviews

Median is the canonical robust statistic for skewed distributions, and SQLite has no PERCENTILE_CONT. Interviewers use this prompt to check whether the candidate knows the portable rank pair pattern: ROW_NUMBER over the partition, COUNT over the partition, then average rows at positions (cnt+1)/2 and (cnt+2)/2. This integer arithmetic handles odd and even counts in one expression.

---

### Break down the requirements

#### Step 1: Tag every row with its rank and the partition size

ROW_NUMBER() OVER (PARTITION BY svc_name ORDER BY amount) gives a 1 indexed position. COUNT(*) OVER (PARTITION BY svc_name) attaches the partition total to every row so the rank pair formula has a denominator without a self join.

#### Step 2: The (cnt+1)/2 and (cnt+2)/2 trick

For odd cnt, both expressions equal the middle row. For even cnt, they pick the two middle rows. Integer division in SQLite rounds toward zero, which matches the convention. Picking BOTH positions and averaging handles the two cases uniformly.

#### Step 3: Average the picked rows per service

WHERE rn IN ((cnt+1)/2, (cnt+2)/2) keeps one or two rows per svc_name. AVG(amount) GROUP BY svc_name produces the median. ORDER BY median DESC then svc_name ASC for the deterministic tie break the prompt requires.

---

### The solution

**Rank pair median in pure SQLite**

```sql
WITH ranked AS (
  SELECT svc_name, amount,
         ROW_NUMBER() OVER (PARTITION BY svc_name ORDER BY amount) AS rn,
         COUNT(*) OVER (PARTITION BY svc_name) AS cnt
  FROM cloud_costs
)
SELECT svc_name, AVG(amount) AS median_amount
FROM ranked
WHERE rn IN ((cnt + 1) / 2, (cnt + 2) / 2)
GROUP BY svc_name
ORDER BY median_amount DESC, svc_name ASC
```

> **Cost Analysis**
>
> The window functions sort 18M rows by (svc_name, amount), which is the dominant cost. After windowing, the WHERE keeps at most 2 rows per service (400 services), so the final aggregation operates on 800 rows or fewer. If cloud_costs were partitioned on svc_name, the sort would be partition local and the plan would parallelize naturally.

> **Interviewers Watch For**
>
> Whether you reach for PERCENTILE_CONT (which fails on SQLite) or NTILE (which only approximates median), and whether you remember integer arithmetic semantics: (cnt+1)/2 with cnt=4 is 2 and (cnt+2)/2 is 3, picking rows 2 and 3 of 4. Strong candidates explain the formula before writing it.

> **Common Pitfall**
>
> Reaching for PERCENTILE_CONT(0.5) WITHIN GROUP works in Postgres but raises a syntax error in SQLite. Stick to the canonical (cnt+1)/2 and (cnt+2)/2 rank pair which is provably correct for all cnt >= 1 and runs on every engine that supports window functions.

---

## Common follow-up questions

- Walk through why (cnt+1)/2 and (cnt+2)/2 work for both odd and even counts. _(Tests algorithm understanding. For odd cnt=5, both are 3. For even cnt=4, they are 2 and 3. Integer division rounds down so the formula collapses to the middle row(s) cleanly.)_
- How would you compute the 90th percentile instead? _(Tests pattern generalization. Use rn equal to CEIL(0.9 times cnt) with the same windowing, or in SQLite the equivalent integer expression (cnt times 9 plus 9) divided by 10.)_
- Why not use NTILE(2) and pick the boundary? _(Tests awareness of NTILE semantics. NTILE distributes rows into N buckets but does not guarantee the boundary row is the median; for small partitions the result diverges from the rank pair median.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/median_cloud_cost_by_service)
- [SQL Interview Questions](https://datadriven.io/sql-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.