# Multi-Host Regions by Node Type

> Some regions are quietly building empires.

Canonical URL: <https://datadriven.io/problems/multi_host_regions_by_node_type>

Domain: SQL · Difficulty: medium · Seniority: L3

## Problem

The capacity planning team is assessing regional infrastructure density. Which regions have more than 2 unique hostnames across compute, storage, network, and GPU node types?

## Worked solution and explanation

### Why this problem exists in real interviews

This probes your ability to combine `GROUP BY`, `COUNT(DISTINCT ...)`, and `HAVING` in a single query. Interviewers want to see whether you can aggregate at the right grain (region plus node type) and then filter on a distinct count, which requires understanding that `HAVING` operates on grouped results.

---

### Break down the requirements

#### Step 1: Filter to the four target node types

Use `WHERE node_type IN ('compute', 'storage', 'network', 'GPU')` to restrict the query to the four categories mentioned in the prompt.

#### Step 2: Group by region and node type

`GROUP BY region, node_type` collapses the 10K rows into roughly 48 groups (12 regions x 4 types). This is the correct grain for counting distinct hostnames per region-type combination.

#### Step 3: Count distinct hostnames per group

`COUNT(DISTINCT hostname)` within each group tells you how many unique hosts serve that node type in that region.

#### Step 4: Filter groups with HAVING

`HAVING COUNT(DISTINCT hostname) > 2` retains only region-type combinations exceeding the threshold. Return `region` and `node_type`.

---

### The solution

**Filter, group, and count distinct**

```sql
SELECT region, node_type
FROM infra_nodes
WHERE node_type IN ('compute', 'storage', 'network', 'GPU')
GROUP BY region, node_type
HAVING COUNT(DISTINCT hostname) > 2
```

> **Cost Analysis**
>
> With `infra_nodes` (10,000 rows), this query scans a small dataset. No indexing is needed for this volume. At production scale, an index on the primary filter column would improve performance.

> **Interviewers Watch For**
>
> Interviewers evaluate whether you translate the English requirements into the correct SQL clauses on the first attempt. They watch for clean syntax, correct column references, and whether you verify edge cases before declaring the query complete.

> **Common Pitfall**
>
> The most common mistake is misreading the prompt's filtering or grouping requirements. Double-check which columns to group by, which to aggregate, and whether the output should be filtered with `WHERE` (before grouping) or `HAVING` (after grouping).

---

## Common follow-up questions

- What happens if the table volume increases by 10x? _(Tests whether the candidate considers scan cost and indexing strategies at scale.)_
- How would you modify this query to handle ties differently? _(Tests understanding of RANK vs DENSE_RANK vs ROW_NUMBER.)_
- Could you rewrite this using a CTE for readability? _(Tests CTE fluency and whether the candidate can restructure without changing semantics.)_
- What if the grouping column had very high cardinality (millions of distinct values)? _(Tests awareness of GROUP BY memory pressure and spill-to-disk behavior.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/multi_host_regions_by_node_type)
- [SQL Interview Questions](https://datadriven.io/sql-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.