# Top 10 CPU-Heavy Nodes

> The ten hungriest nodes.

Canonical URL: <https://datadriven.io/problems/top_10_cpu_heavy_nodes>

Domain: SQL · Difficulty: medium · Seniority: L3

## Problem

Our infrastructure metrics table tracks CPU usage per node. Show the ten nodes consuming the most CPU, with their node ID, hostname, and CPU percentage, sorted from highest utilization to lowest.

## Worked solution and explanation

### Why this problem exists in real interviews

Top-N over a single snapshot is one of the highest-frequency screening questions because it bundles ORDER BY, multi-key tie-breaking, and LIMIT in one query. Reading the schema is also part of the test: `infra_nodes.cpu_pct` is already the per-node value (one row per node), so candidates who reach for AVG and GROUP BY signal they did not look at the table grain.

---

### Break down the requirements

#### Step 1: Confirm the grain is already per-node

`infra_nodes` is one row per node (12,000 rows, `node_id` cardinality 12,000). No aggregation needed. Just project `node_id, hostname, cpu_pct`.

#### Step 2: Sort with explicit tie-break

`ORDER BY cpu_pct DESC` puts the hungriest CPU first. Add `, node_id ASC` as the deterministic tie-breaker. Without the second key, ties at the cutoff produce non-deterministic top-10 sets across runs.

#### Step 3: Cap at 10 rows

`LIMIT 10` cuts the sorted result to exactly the 10 requested rows.

---

### The solution

**Sort by cpu_pct, tie-break by node_id, take 10**

```sql
SELECT node_id, hostname, cpu_pct FROM infra_nodes ORDER BY cpu_pct DESC, node_id ASC LIMIT 10
```

> **Cost Analysis**
>
> Across 12,000 rows the planner does a single sequential scan and a top-K sort that holds at most 10 rows in a heap. An index on `(cpu_pct DESC, node_id ASC)` would let the planner walk the index and stop after 10 reads, but at this row count the sequential scan is already trivial.

> **Interviewers Watch For**
>
> Interviewers watch for unnecessary GROUP BY (the table is already at the right grain), missing tie-breaker (`cpu_pct` is DOUBLE with cardinality 10K so collisions are rare but not zero), and over-projection (selecting `region` or `status` when the prompt asked for three columns).

> **Common Pitfall**
>
> Adding `WHERE status = 'active'` or any filter the prompt did not request. The prompt says 'the 10 nodes with the highest cpu_pct' with no status restriction. Filtering hides rows that should compete for the top 10 and the grader rejects the result.

---

## Common follow-up questions

- How would you return the top 10 CPU-heavy nodes per `region` instead of overall? _(Tests window-function ranking. `ROW_NUMBER() OVER (PARTITION BY region ORDER BY cpu_pct DESC, node_id ASC)` in a CTE with an outer `WHERE rn <= 10` filter delivers per-region top-10s.)_
- What if the prompt asked for the top 10 by combined CPU+memory pressure rather than CPU alone? _(Tests composite ranking. The candidate should propose either weighting (e.g. `cpu_pct + mem_pct` in the ORDER BY) or a normalized blend, and call out that the two metrics live on different scales if they are not both percentages.)_
- If you needed to exclude nodes whose `status` is anything other than 'active', where in the query does that filter go? _(Tests filter ordering. The candidate should add `WHERE status = 'active'` (or whichever value indicates 'in service') BEFORE the ORDER BY so retired nodes never compete for the top 10.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/top_10_cpu_heavy_nodes)
- [SQL Interview Questions](https://datadriven.io/sql-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.