# Region With Most Nodes

> Which region hosts the most?

Canonical URL: <https://datadriven.io/problems/region_with_most_nodes>

Domain: SQL · Difficulty: medium · Seniority: L3

## Problem

Capacity planning wants to know which region has the densest infrastructure footprint. Show the single region with the most nodes alongside its count.

## Worked solution and explanation

### Why this problem exists in real interviews

This problem targets grouped COUNT aggregation across the `infra_nodes` table. You need to work with the `region` column to satisfy the requirements.

---

### Break down the requirements

#### Step 1: Aggregate by `region`

`GROUP BY region` collapses rows to one per group. The aggregate functions (`SUM`, `COUNT`, `AVG`, etc.) compute the metric for each group.

#### Step 2: Order and limit the output

`ORDER BY` with `LIMIT` returns only the top result. The sort must be deterministic; add a tiebreaker column if needed.

---

### The solution

**Aggregate by `region` to find region with most nodes**

```sql
SELECT region, COUNT(*) AS node_count
FROM infra_nodes
GROUP BY region
ORDER BY node_count DESC
LIMIT 1
```

> **Cost Analysis**
>
> With ~10,000 rows, the GROUP BY reduces the working set before any downstream operations. An index on the filter/join columns would reduce the scan to a seek.

> **Interviewers Watch For**
>
> Interviewers watch for whether the query returns exactly the columns and ordering the prompt specifies; how quickly you identify the core operation and write clean, minimal code.

> **Common Pitfall**
>
> Using LIMIT without ORDER BY returns an arbitrary subset. Always pair LIMIT with a deterministic ORDER BY.

---

## Common follow-up questions

- What would happen to your result if `infra_nodes.hostname` contained duplicate values that you did not expect? _(Tests whether the candidate considers data quality issues in `hostname` and uses DISTINCT or deduplication where needed.)_
- The `node_type` column in `infra_nodes` is heavily skewed toward a few popular values. How would data skew affect parallel execution of your query? _(Tests understanding of skew in `infra_nodes.node_type` and its impact on distributed query performance.)_
- `infra_nodes.status` only has 3 distinct values. If a new category were added, would your query automatically include it? _(Tests whether the query hard-codes values or dynamically adapts to `status` changes.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/region_with_most_nodes)
- [SQL Interview Questions](https://datadriven.io/sql-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.