# Average Node CPU by Region

> Average infrastructure node CPU usage broken down by region

Canonical URL: <https://datadriven.io/problems/average_node_cpu_by_region>

Domain: SQL · Difficulty: easy · Seniority: L3

## Problem

Capacity planning is evaluating whether to add nodes in specific regions, but they only care about regions that are actually running workloads. Using the pod assignment records, show the average CPU utilization of nodes in each region where at least one pod has been scheduled.

## Worked solution and explanation

### Why this problem exists in real interviews

The `pod_assignments`, `infra_nodes` schema makes this a clean test of grouped aggregation. Columns like `svc_name`, `assigned_at`, `status` introduce enough ambiguity that only candidates who clarify assumptions produce correct results.

---

### Break down the requirements

#### Step 1: Join `pod_assignments` with `infra_nodes`

Use `JOIN` on `node_id` to link fact rows to the dimension table. INNER JOIN keeps only matched rows.

#### Step 2: Group by `hostname`

`GROUP BY b.hostname` produces one row per distinct value of the grouping column.

#### Step 3: Aggregate with AVG

`AVG(cpu_pct)` computes the metric at the grouped grain.

#### Step 4: Order by the metric

Sort descending to surface the top values first.

---

### The solution

**Group-aggregate for average node cpu region**

```sql
SELECT
    b.hostname,
    AVG(cpu_pct) AS avg_cpu_pct
FROM pod_assignments a
JOIN infra_nodes b ON a.node_id = b.node_id
GROUP BY b.hostname
ORDER BY avg_cpu_pct DESC
```

> **Cost Analysis**
>
> The main table has 500K rows. The GROUP BY reduces the row count early, keeping downstream operations cheap. The smaller dimension table keeps the join selective.

> **Interviewers Watch For**
>
> Strong candidates state the correct `GROUP BY` grain before writing any SQL, showing they think about the output shape first. Join type selection (INNER vs LEFT) reveals whether the candidate read the requirements carefully.

> **Common Pitfall**
>
> Selecting a non-aggregated column without including it in `GROUP BY` is the most common error. Some engines reject it; others silently return arbitrary values.

---

## Common follow-up questions

- What happens to your results if `svc_name` in `pod_assignments` contains trailing whitespace or mixed casing? _(Tests awareness of text normalization issues that silently fragment GROUP BY results.)_
- If `pod_assignments` and `infra_nodes` have a one-to-many relationship, how does that affect the COUNT in your GROUP BY? _(Tests understanding of fan-out: joining before grouping can inflate counts.)_
- The `node_id` column in `pod_assignments` has a zipf distribution, meaning a few values dominate. How does that skew affect your query plan and parallelism? _(Tests understanding of data skew: the optimizer may choose a bad plan when histogram statistics are stale.)_
- Could you express this same logic as a single query without CTEs or subqueries? What readability trade-off does that introduce? _(Tests whether the candidate can flatten nested logic and understands when decomposition aids maintainability.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/average_node_cpu_by_region)
- [SQL Interview Questions](https://datadriven.io/sql-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.