# Top Regions by High CPU Nodes

> Five regions with the hottest CPUs.

Canonical URL: <https://datadriven.io/problems/top_regions_by_high_cpu_nodes>

Domain: SQL · Difficulty: hard · Seniority: L5

## Problem

Capacity planning wants to know which regions have the most overloaded nodes. Return the top 5 regions by count of nodes running above 90% CPU. If regions are tied, they share the same rank without gaps. Include all regions within the top 5 ranks with their high-CPU node count.

## Worked solution and explanation

### Why this problem exists in real interviews

Using `infra_nodes`, this tests filtering to the top rows after aggregation with proper grain management. Strong candidates immediately identify the grouping key and metric column before writing any window function.

---

### Break down the requirements

#### Step 1: Aggregate per region

`GROUP BY region` with the appropriate aggregate function produces one summary row per group from the `infra_nodes` table.

#### Step 2: Rank the results

`ORDER BY` the aggregate descending with `LIMIT` to surface the top entries.

---

### The solution

**Count high-CPU nodes per region from infra_nodes, dense-rank top 5**

```sql
SELECT
    region,
    SUM(cpu_pct) AS total_cpu_pct
FROM infra_nodes
GROUP BY region
ORDER BY total_cpu_pct DESC
LIMIT 10
```

> **Cost Analysis**
>
> The GROUP BY reduces the 12K-row `infra_nodes` table to the number of distinct `region` values. A covering index on `(region, cpu_pct)` enables an index-only aggregate scan.

> **Interviewers Watch For**
>
> Interviewers verify you aggregate before sorting. Sorting raw rows gives per-row values, not group totals. The correct grain is one row per `region`.

> **Common Pitfall**
>
> Using the wrong aggregate function. `SUM` gives totals, `COUNT` gives volume, `AVG` gives rates. Read the prompt to determine which metric is needed.

---

## Common follow-up questions

- If cpu_pct is stored as a percentage (0-100) vs a fraction (0-1), how does the > 90 threshold change? _(Tests unit awareness; the candidate should confirm the column's scale from sample data.)_
- Why use DENSE_RANK instead of LIMIT 5 here? _(Tests that LIMIT 5 truncates ties, while DENSE_RANK <= 5 includes all regions sharing the fifth rank.)_
- If some nodes have NULL cpu_pct, does your WHERE cpu_pct > 90 exclude them automatically? _(Tests NULL comparison; any comparison with NULL yields NULL (falsy), so NULLs are excluded by WHERE.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/top_regions_by_high_cpu_nodes)
- [SQL Interview Questions](https://datadriven.io/sql-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.