# Average Node Utilization

> CPU and memory, region by region.

Canonical URL: <https://datadriven.io/problems/average_node_utilization>

Domain: SQL · Difficulty: easy · Seniority: L3

## Problem

The quarterly capacity review is tomorrow and the VP wants a utilization heatmap by region and node type. Show average CPU and memory utilization for each combination so the team can spot over- and under-provisioned segments.

## Worked solution and explanation

### What this is really asking

`infra_nodes` carries `cpu_pct` and `mem_pct` per host, and the VP wants one cell per (region, node_type). The whole problem is grouping by both keys at once, not separately.

---

### Break down the requirements

#### Step 1: Group by the composite key

GROUP BY region, node_type creates one row per cell of the heatmap. Grouping by either column alone collapses the other dimension and the VP loses the segment view.

#### Step 2: Two averages, one pass

AVG(cpu_pct) AS avg_cpu_pct and AVG(mem_pct) AS avg_mem_pct in the same SELECT. Both aggregates share the same group, so this is a single scan over 10k rows.

---

### The solution

**AVG UTILIZATION BY REGION AND TYPE**

```sql
SELECT
    region,
    node_type,
    AVG(cpu_pct) AS avg_cpu_pct,
    AVG(mem_pct) AS avg_mem_pct
FROM infra_nodes
GROUP BY region, node_type
```

> **Cost Analysis**
>
> 10k rows is trivial: one scan, hash aggregate keyed on (region, node_type). Output rows equal the number of populated cells. No filter, no join, no window function. The plan should be a single GroupAggregate node.

> **Interviewers Watch For**
>
> Whether you grouped by both keys or only one. Grouping by region alone gives a regional average that hides over-provisioned node types inside healthy regions, which is exactly what the VP is trying to find.

> **Common Pitfall**
>
> Filtering `WHERE status = 'active'` without being asked. Decommissioned nodes still count toward provisioned capacity, so dropping them quietly tilts the heatmap. Confirm scope before adding a filter the prompt did not request.

---

### COMMON FOLLOW-UP QUESTIONS

## Common follow-up questions

- How would you flag cells that are over-provisioned? _(Wrap the query in a CTE and filter where avg_cpu_pct < 20 AND avg_mem_pct < 20, or threshold against a target utilization band.)_
- What if some regions have only one node of a type? _(AVG over one row equals that row. Add COUNT(*) to the projection so the VP can see sample size before trusting outliers.)_
- How would you compare this quarter to last quarter? _(You need a snapshot column like captured_at. Self-join the aggregated CTE on (region, node_type) across two periods and diff the averages.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/average_node_utilization)
- [SQL Interview Questions](https://datadriven.io/sql-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.