# Servers Returning to Origin

> Servers that migrated back home.

Canonical URL: <https://datadriven.io/problems/servers_returning_to_origin>

Domain: SQL · Difficulty: medium · Seniority: L4

## Problem

How many servers ended up back in their original region? 'Original' means the region of their first recorded migration, and 'ended up' means the region of their last. Return a single count.

## Worked solution and explanation

### Why this problem exists in real interviews

This problem targets custom window frame specification across the `infra_nodes` table. You need to work with the `node_id` and `region` columns to satisfy the requirements.

> **Trick to Solving**
>
> Rolling or sliding window problems require an explicit frame clause. The default frame is rarely what you want.
> 
> 1. Identify the window size from the prompt (e.g., '3-month rolling')
> 2. Use `ROWS BETWEEN N PRECEDING AND CURRENT ROW`
> 3. Partition by the grouping key, order by the time column

---

### Break down the requirements

#### Step 1: Apply the range filter

The WHERE clause restricts rows to the target range. Applying this filter early reduces the volume flowing into downstream operations.

#### Step 2: Sort the final output

The `ORDER BY` clause ensures the result appears in the expected sequence. Interviewers check that the sort direction matches the prompt.

#### Step 3: Use a subquery to find the reference value

The scalar subquery computes a single value (like the maximum) that the outer query filters against. This avoids a self-join.

---

### The solution

**Sliding-window for servers returning to origin**

```sql
SELECT COUNT(DISTINCT node_id) AS servers_returning
FROM (SELECT node_id, FIRST_VALUE(region) OVER (PARTITION BY node_id ORDER BY node_id ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS first_region, LAST_VALUE(region) OVER (PARTITION BY node_id ORDER BY node_id ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS last_region FROM infra_nodes) sub
WHERE first_region = last_region
```

> **Cost Analysis**
>
> With ~10,000 rows, the window function runs on the reduced set after filtering and grouping. An index on the filter/join columns would reduce the scan to a seek.

> **Interviewers Watch For**
>
> Interviewers watch for whether you explicitly define the window frame or rely on defaults that may not match the requirement; whether you use a subquery or self-join, and can explain the tradeoffs; whether you know when DISTINCT is needed and when it masks a logic error.

> **Common Pitfall**
>
> Omitting the explicit frame clause (`ROWS BETWEEN ...`) relies on the default, which is `RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW` and may not match the intent.

---

## Common follow-up questions

- What would happen to your result if `infra_nodes.hostname` contained duplicate values that you did not expect? _(Tests whether the candidate considers data quality issues in `hostname` and uses DISTINCT or deduplication where needed.)_
- The `cpu_pct` column in `infra_nodes` is heavily skewed toward a few popular values. How would data skew affect parallel execution of your query? _(Tests understanding of skew in `infra_nodes.cpu_pct` and its impact on distributed query performance.)_
- How would you modify this query if the business logic required grouping by both `node_id` and `hostname` instead of just one? _(Tests ability to adapt the query structure to changing requirements.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/servers_returning_to_origin)
- [SQL Interview Questions](https://datadriven.io/sql-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.