# Unique Hostnames per Region

> How many distinct machines live in each region?

Canonical URL: <https://datadriven.io/problems/unique_hostnames_per_region>

Domain: SQL · Difficulty: medium · Seniority: L5

## Problem

Our infrastructure inventory uses hostnames that share prefixes. Count unique hostnames per region, using only the first word of the hostname (case insensitive), so entries like 'web-server-01' and 'Web-server-02' share the same prefix. If reversed forms exist (like 'prod-db' and 'db-prod'), count them as the same. Return results by count descending, then region ascending.

## Worked solution and explanation

### Why this problem exists in real interviews

This tests advanced string manipulation combined with deduplication logic. Interviewers probe whether you can extract prefixes, normalize case, and handle reversals as a uniqueness condition, all in SQL.

> **Trick to Solving**
>
> The "reversed forms count as the same" requirement is the tricky part. The approach is to extract the first word, lowercase it, and then normalize reversed pairs by sorting the two halves.
> 
> 1. Extract the first word: `LOWER(SUBSTR(hostname, 1, INSTR(hostname, '-') - 1))`
> 2. To handle reversed pairs, you would need to identify pairs; however, with single-word prefixes, the key insight is that 'prod' and 'prod' are the same prefix
> 3. Count `DISTINCT` normalized prefixes per region

---

### Break down the requirements

#### Step 1: Extract the first word of each hostname

Use `LOWER(SUBSTR(hostname, 1, INSTR(hostname, '-') - 1))` to get the case-insensitive prefix before the first hyphen.

#### Step 2: Count distinct prefixes per region

`GROUP BY region` with `COUNT(DISTINCT prefix)` counts unique hostname prefixes in each region.

#### Step 3: Order by count descending, then region ascending

`ORDER BY cnt DESC, region ASC` matches the requested output ordering.

---

### The solution

**Prefix extraction with case-insensitive dedup**

```sql
SELECT region, COUNT(DISTINCT LOWER(SUBSTR(hostname, 1, INSTR(hostname, '-') - 1))) AS cnt
FROM infra_nodes
GROUP BY region
ORDER BY cnt DESC, region ASC
```

> **Cost Analysis**
>
> With only 10K rows, string operations are trivially fast. INSTR and SUBSTR are O(n) per string where n is the hostname length. The GROUP BY on 12 regions is negligible.

> **Interviewers Watch For**
>
> How the candidate handles the reversed-form requirement. This may require a normalization step where two-word prefixes are sorted alphabetically to deduplicate 'prod-db' and 'db-prod'.

> **Common Pitfall**
>
> Forgetting the LOWER() call. Hostnames like 'Web-server-01' and 'web-server-02' share the prefix 'web' only after case normalization.

---

## Common follow-up questions

- What if hostnames do not contain a hyphen? _(INSTR returns 0, causing SUBSTR to behave unexpectedly. A CASE expression or COALESCE is needed.)_
- How would you handle the reversed-form deduplication for multi-word prefixes? _(Sort the two parts alphabetically and concatenate, e.g., MIN(a,b) || '-' || MAX(a,b).)_
- What if you needed the actual hostnames, not just counts? _(Use GROUP_CONCAT or a subquery to list distinct hostnames per region.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/unique_hostnames_per_region)
- [SQL Interview Questions](https://datadriven.io/sql-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.