# Running Node Pairs

> Two servers, same region, both alive.

Canonical URL: <https://datadriven.io/problems/running_node_pairs>

Domain: SQL · Difficulty: easy · Seniority: L4

## Problem

Find pairs of servers in the same region that are both running. Show unique pairs where the lower node ID appears in the first column.

## Worked solution and explanation

### Why this problem exists in real interviews

The core skill being tested is self-join, applied to the `infra_nodes` table in a infrastructure context. Getting columns like `node_id`, `region`, and `status` right is where most candidates slip.

---

### Break down the requirements

#### Step 1: Self-join the table to pair rows

Join `infra_nodes` to itself to compare rows within the same table. The join condition controls which pairs are valid and prevents duplicate mirrors.

#### Step 2: Deduplicate the result with DISTINCT

`SELECT DISTINCT` removes duplicate rows from the output. This is necessary when joins or subqueries can produce repeated combinations.

---

### The solution

**Self-join the table to pair rows to find running node pairs**

```sql
SELECT DISTINCT a.node_id AS node_id_1, b.node_id AS node_id_2, a.region
FROM infra_nodes a
INNER JOIN infra_nodes b ON a.region = b.region AND a.node_id < b.node_id
WHERE LOWER(a.status) = 'running' AND LOWER(b.status) = 'running'
ORDER BY a.node_id, b.node_id
```

> **Cost Analysis**
>
> With ~8,000 rows, the join cost depends on the smaller table's cardinality. An index on the filter/join columns would reduce the scan to a seek.

> **Interviewers Watch For**
>
> Interviewers watch for how you prevent duplicate pairs and ensure the join condition is correct.

> **Common Pitfall**
>
> A self-join without an inequality condition (`a.id < b.id`) produces duplicate mirrored pairs and a self-pair row for each entry.

---

## Common follow-up questions

- What would happen to your result if `infra_nodes.region` contained duplicate values that you did not expect? _(Tests whether the candidate considers data quality issues in `region` and uses DISTINCT or deduplication where needed.)_
- The `cpu_pct` column in `infra_nodes` is heavily skewed toward a few popular values. How would data skew affect parallel execution of your query? _(Tests understanding of skew in `infra_nodes.cpu_pct` and its impact on distributed query performance.)_
- If this query ran as a scheduled job, how would you add monitoring to detect when the result set is suspiciously empty? _(Tests operational awareness around scheduled query jobs.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/running_node_pairs)
- [SQL Interview Questions](https://datadriven.io/sql-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.