# Services Hitting Cost Threshold

> The budget line is here. How many crossed it?

Canonical URL: <https://datadriven.io/problems/services_hitting_cost_threshold>

Domain: SQL · Difficulty: hard · Seniority: L4

## Problem

For each month, calculate what percentage of services reached at least $100 in monthly cloud spend. Exclude rows where bill_date is null (cancelled billing entries). Return the month and percentage hitting the threshold.

## Worked solution and explanation

### Why this problem exists in real interviews

Interviewers use this cloud cost scenario to test conditional aggregation via CASE against the `cloud_costs` table. The focus is on how you handle the `bill_date` column when building the result. It also requires date extraction for time bucketing.

> **Trick to Solving**
>
> When the prompt asks for multiple metrics split by a condition (e.g., resolved vs. unresolved), conditional aggregation avoids multiple passes.
> 
> 1. Spot the split: two or more categories in one output row
> 2. Use `SUM(CASE WHEN condition THEN 1 ELSE 0 END)` for each bucket
> 3. Group by the common dimension

---

### Break down the requirements

#### Step 1: Isolate the intermediate result in a CTE

The `monthly_svc` CTE computes the intermediate aggregation that the outer query builds on. This separation keeps each layer focused on a single task.

#### Step 2: Filter out null values

Exclude rows where `bill_date` is NULL. This prevents nulls from polluting aggregations or creating phantom groups.

#### Step 3: Use conditional aggregation with CASE

A `CASE` expression inside the aggregate function splits rows into buckets without multiple passes over the data. Each condition maps to one output column.

---

### The solution

**Case pivot for services hitting cost threshold**

```sql
WITH monthly_svc AS (
    SELECT svc_name, strftime('%Y-%m', bill_date) AS month, SUM(amount) AS total_spend
    FROM cloud_costs
    WHERE bill_date IS NOT NULL
    GROUP BY svc_name, strftime('%Y-%m', bill_date)
)
SELECT month, CAST(SUM(CASE WHEN total_spend >= 100 THEN 1 ELSE 0 END) AS DOUBLE) * 100.0 / COUNT(*) AS pct_hitting_threshold
FROM monthly_svc
GROUP BY month
ORDER BY month
```

> **Cost Analysis**
>
> With ~18M rows, the GROUP BY reduces the working set before any downstream operations; CTEs materialize intermediate results, which can be beneficial or costly depending on the engine. An index on the filter/join columns would reduce the scan to a seek.

> **Interviewers Watch For**
>
> Interviewers watch for whether you decompose the problem into named, testable stages rather than nesting everything; how you handle NULL values and whether you account for them in filters and aggregations; whether you can pivot data with conditional aggregation in a single pass instead of multiple queries.

> **Common Pitfall**
>
> Forgetting to filter NULLs creates phantom groups or inflated counts. Always check `null_fraction` in the schema before assuming columns are clean.

---

## Common follow-up questions

- If `bill_date` in `cloud_costs` is NULL for some rows, how would your aggregation or join logic be affected? _(Probes understanding of NULL propagation through joins and aggregate functions on `cloud_costs.bill_date`.)_
- `cloud_costs.amount` has roughly 3,000,000 distinct values. What index strategy would you use to avoid a full scan on `cloud_costs`? _(Tests indexing knowledge specific to the high-cardinality `amount` column in `cloud_costs`.)_
- If `cloud_costs` contained late-arriving rows that were inserted after your query ran, how would you design an incremental update instead of re-aggregating? _(Tests understanding of incremental aggregation patterns.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/services_hitting_cost_threshold)
- [SQL Interview Questions](https://datadriven.io/sql-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.