# Mid-CPU Nodes

> Not the heaviest. Not the lightest. The middle.

Canonical URL: <https://datadriven.io/problems/mid_cpu_nodes>

Domain: SQL · Difficulty: easy · Seniority: L3

## Problem

Before the holiday traffic freeze, the capacity team wants to identify nodes with headroom to absorb more load, defined as those running between 40 and 80 percent CPU inclusive. Pull all fields for every qualifying node.

## Worked solution and explanation

### Why this problem exists in real interviews

Extracting insights from infra_nodes.hostname grouped by region via query construction is the central task. It is used as a fundamentals check to test whether you pick the right aggregation function and partition boundary on the first attempt.

---

### Break down the requirements

#### Step 1: Read from `infra_nodes`

The query targets `infra_nodes` with 7 columns. Identify which columns are needed for the output.

#### Step 2: Filter to the target rows

Use `BETWEEN` in the `WHERE` clause to select the target range. This is both readable and optimizable by the query planner.

#### Step 3: Return the result set

Select the required columns with any necessary aliasing or formatting.

---

### The solution

**BETWEEN range for capacity headroom**

```sql
SELECT node_id, hostname, region, node_type, cpu_pct, mem_pct, status
FROM infra_nodes
WHERE cpu_pct BETWEEN 40 AND 80
```

> **Cost Analysis**
>
> The query scans 8K rows from `infra_nodes`. CTEs in most engines are optimization fences. For production workloads, consider inlining or materializing the intermediate results.

> **Interviewers Watch For**
>
> Breaking complex logic into named CTEs shows the interviewer you prioritize readability and debuggability.

> **Common Pitfall**
>
> Comparing dates stored as TEXT without casting can produce lexicographic instead of chronological ordering. Always confirm the column type.

---

## Common follow-up questions

- If infra_nodes.node_id could contain unexpected NULL values, how would your query behave? _(Tests NULL awareness even when the schema does not currently allow NULLs in node_id.)_
- How would you verify that your aggregation on infra_nodes.node_id is not double-counting due to duplicate rows? _(Tests data quality awareness and deduplication strategies.)_
- What index would you add to infra_nodes to avoid a full table scan when filtering or sorting by node_id? _(Tests practical indexing decisions for numeric filter columns.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/mid_cpu_nodes)
- [SQL Interview Questions](https://datadriven.io/sql-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.