# Regions With 5+ Nodes

> Regions with five or more nodes.

Canonical URL: <https://datadriven.io/problems/regions_with_5_nodes>

Domain: SQL · Difficulty: easy · Seniority: L3

## Problem

The infrastructure team is enforcing a minimum cluster size policy: any region with fewer than 5 running nodes needs additional provisioning. Show the regions that already meet the threshold.

## Worked solution and explanation

### Why this problem exists in real interviews

Built around the `infra_nodes` table, this challenge probes your ability to apply HAVING for post-aggregation filtering in a infrastructure setting. Correctly referencing the `region` column is essential to a working solution.

---

### Break down the requirements

#### Step 1: Apply the WHERE filter

Filter rows before any aggregation. This ensures only qualifying data enters the computation, keeping the result correct and the scan minimal.

#### Step 2: Aggregate by `region`

`GROUP BY region` collapses rows to one per group. The aggregate functions (`SUM`, `COUNT`, `AVG`, etc.) compute the metric for each group.

#### Step 3: Filter groups with HAVING

HAVING applies after GROUP BY, filtering out groups that do not meet the threshold. This cannot be done in WHERE because the aggregate has not been computed yet.

---

### The solution

**Having filter for regions with 5+ nodes**

```sql
SELECT region, COUNT(*) AS node_count
FROM infra_nodes
WHERE LOWER(status) = 'running'
GROUP BY region
HAVING COUNT(*) >= 5
```

> **Cost Analysis**
>
> With ~8,000 rows, the GROUP BY reduces the working set before any downstream operations. An index on the filter/join columns would reduce the scan to a seek.

> **Interviewers Watch For**
>
> Interviewers watch for whether you use HAVING (not WHERE) to filter after aggregation.

> **Common Pitfall**
>
> Putting the aggregate condition in WHERE instead of HAVING causes a syntax error. WHERE runs before GROUP BY; HAVING runs after.

---

## Common follow-up questions

- What would happen to your result if `infra_nodes.node_type` contained duplicate values that you did not expect? _(Tests whether the candidate considers data quality issues in `node_type` and uses DISTINCT or deduplication where needed.)_
- The `node_type` column in `infra_nodes` is heavily skewed toward a few popular values. How would data skew affect parallel execution of your query? _(Tests understanding of skew in `infra_nodes.node_type` and its impact on distributed query performance.)_
- If the HAVING threshold in your query changed from a fixed number to a percentile, how would you restructure the query? _(Tests ability to replace static HAVING filters with dynamic subquery-based thresholds.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/regions_with_5_nodes)
- [SQL Interview Questions](https://datadriven.io/sql-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.