# Bottom 2% Services by Spend

> The bottom 2% of spenders. Who are they?

Canonical URL: <https://datadriven.io/problems/bottom_2_services_by_spend>

Domain: SQL · Difficulty: hard · Seniority: L4

## Problem

FinOps is hunting for zombie services that cost almost nothing but still consume billing line items. Identify the bottom 2% of services by total cloud spend in May 2026 by bucketing them into 50 equal-sized percentile groups and returning those in the lowest bucket. Show the service name and total spend, from lowest up.

## Worked solution and explanation

### Why this problem exists in real interviews

FinOps interviewers hand you this to see whether you treat 'bottom 2%' as a precise statistical statement or a loose phrase. The probe: do you stop and ask whether they want equal-sized buckets (NTILE), equal-width value ranges (PERCENT_RANK), or strict tail count (ROW_NUMBER / total). Pick the wrong tool and your zombie-service list quietly excludes real zombies.

---

### Break down the requirements

#### Step 1: Filter to May of the target year

Use `strftime('%Y', bill_date)` and `strftime('%m', bill_date)` in WHERE. Note for the interviewer that this defeats the `bill_date` partition prune on a 20M-row table; in prod you'd use a range predicate.

#### Step 2: Roll up to service grain

`SUM(amount) AS total_spend GROUP BY svc_name`. The grain shifts from billing line item to service, so any later ranking is over service-level totals, not raw rows.

#### Step 3: Bucket into 50 equal-sized groups

`NTILE(50) OVER (ORDER BY total_spend)` in a second CTE. NTILE splits rows evenly across 50 buckets, so bucket 1 holds the cheapest ceil(N/50) services regardless of how compressed their spend values are.

#### Step 4: Return bucket 1 ordered ascending

`WHERE percentile_bucket = 1 ORDER BY total_spend`. Cheapest first matches the FinOps reading order: kill the smallest line items first.

---

### The solution

**BOTTOM 2 PERCENT BY NTILE**

```sql
WITH monthly_totals AS (
  SELECT svc_name, SUM(amount) AS total_spend
  FROM cloud_costs
  WHERE strftime('%Y', bill_date) = '2026'
    AND strftime('%m', bill_date) = '05'
  GROUP BY svc_name
),
ranked AS (
  SELECT svc_name,
         total_spend,
         NTILE(50) OVER (ORDER BY total_spend) AS percentile_bucket
  FROM monthly_totals
)
SELECT svc_name, total_spend
FROM ranked
WHERE percentile_bucket = 1
ORDER BY total_spend;
```

> **Cost Analysis**
>
> `cloud_costs` is 20M rows partitioned by `bill_date`. Wrapping `bill_date` in `strftime` blocks partition pruning, forcing a full scan. The aggregation collapses 20M to a few thousand `svc_name` rows, then NTILE sorts that small set cheaply. Real fix: `bill_date >= DATE('2026-05-01') AND bill_date < DATE('2026-06-01')`.

> **Interviewers Watch For**
>
> Ask out loud: 'When you say bottom 2%, do you mean the 2% of services with the lowest spend (NTILE), or services whose spend is in the bottom 2% of the value range (PERCENT_RANK)?' The prompt says bucket into 50 groups, so NTILE wins, but naming the alternative is the signal.

> **Common Pitfall**
>
> Reaching for `total_spend < (SELECT 0.02 * SUM(total_spend) ...)`. That filters by dollar threshold, not by service count, and it lumps in services with normal spend if your distribution is long-tailed. The prompt explicitly asks for percentile groups; NTILE is the contract.

> **NTILE distributes rows, not values**
>
> With 487 services and NTILE(50), bucket 1 holds 10 services (`ceil(487/50)`), not 'services whose spend is below the 2nd percentile of spend amount'. If two services tie on the boundary, NTILE assigns them deterministically by ORDER BY; ties don't expand a bucket.

---

### COMMON FOLLOW-UP QUESTIONS

## Common follow-up questions

- How would you rewrite this with PERCENT_RANK and when would you prefer it? _(Tests whether you can distinguish quantile-by-count vs quantile-by-value semantics.)_
- What changes if a service has zero spend in May but still appears in `cloud_costs` with `amount = 0`? _(Probes whether you understand that GROUP BY keeps zero-sum services and they will dominate bucket 1.)_
- Make this incremental: only recompute buckets for services whose May spend changed today. _(Pushes you toward materialized monthly_totals and a watermark on `bill_date`, the FinOps production pattern.)_
- How would you rewrite the date filter so the partition prune actually fires? _(Confirms you know function-on-column kills partition elimination.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/bottom_2_services_by_spend)
- [SQL Interview Questions](https://datadriven.io/sql-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.