# Provider Spend Variance Between Halves

> Two time windows. Did the cloud bill go up or down?

Canonical URL: <https://datadriven.io/problems/provider_spend_variance_between_halves>

Domain: SQL · Difficulty: medium · Seniority: L4

## Problem

Compare cloud spend between the first half (Jan through Jun) and second half (Jul through Dec) of 2026. For each provider, compute the total amount in each half, then show the difference (second half minus first half). Rank providers from biggest increase to smallest. Return the provider and amount difference.

## Worked solution and explanation

### Why this problem exists in real interviews

This cloud cost problem uses the `cloud_costs` table to evaluate conditional aggregation via CASE. Candidates must also demonstrate date extraction for time bucketing. Watch how the `bill_date` and `provider` columns interact in the grouping and filtering logic.

---

### Break down the requirements

#### Step 1: Split the year into halves

Use `CASE WHEN month BETWEEN 1 AND 6 THEN 'H1' ELSE 'H2' END` after extracting the month from `bill_date`.

#### Step 2: Sum per provider and half

`GROUP BY provider` with conditional `SUM` for each half.

#### Step 3: Compute and rank the difference

Subtract H1 from H2 and `ORDER BY difference DESC`.

---

### The solution

**Case pivot for provider spend variance between halves**

```sql
SELECT
    provider,
    SUM(CASE WHEN CAST(strftime('%m', bill_date) AS INTEGER) BETWEEN 7 AND 12 THEN amount ELSE 0 END)
    - SUM(CASE WHEN CAST(strftime('%m', bill_date) AS INTEGER) BETWEEN 1 AND 6 THEN amount ELSE 0 END) AS spend_difference
FROM cloud_costs
WHERE strftime('%Y', bill_date) = '2026'
GROUP BY provider
ORDER BY spend_difference DESC
```

> **Cost Analysis**
>
> With `cloud_costs` (12,000,000 rows), the full scan reads significant data. A composite index on the filter columns pushes the predicate into the index layer. Pre-aggregation in a materialized view is worth considering at this scale.

> **Interviewers Watch For**
>
> Interviewers evaluate whether you translate the English requirements into the correct SQL clauses on the first attempt. They watch for clean syntax, correct column references, and whether you verify edge cases before declaring the query complete.

> **Common Pitfall**
>
> The most common mistake is misreading the prompt's filtering or grouping requirements. Double-check which columns to group by, which to aggregate, and whether the output should be filtered with `WHERE` (before grouping) or `HAVING` (after grouping).

---

## Common follow-up questions

- What would happen to your result if `cloud_costs.amount` contained duplicate values that you did not expect? _(Tests whether the candidate considers data quality issues in `amount` and uses DISTINCT or deduplication where needed.)_
- With 2,000,000 distinct values in `cloud_costs.amount`, how would a composite index on the GROUP BY columns change the execution plan? _(Probes understanding of how cardinality in `amount` affects grouping and sort operations.)_
- If `cloud_costs` contained late-arriving rows that were inserted after your query ran, how would you design an incremental update instead of re-aggregating? _(Tests understanding of incremental aggregation patterns.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/provider_spend_variance_between_halves)
- [SQL Interview Questions](https://datadriven.io/sql-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.