# The Legacy Hunt

> Old data. Still matters.

Canonical URL: <https://datadriven.io/problems/the_legacy_hunt>

Domain: SQL · Difficulty: easy · Seniority: L3

## Problem

The data engineering team is sunsetting pipelines that predate a recent infrastructure migration. Surface every unique pipeline name that had at least one run before July 1, 2026.

## Worked solution and explanation

### The mental model

The real lesson hiding inside this trivial-looking query is the phrase **"at least one"**. It signals existence, not aggregation. Three idioms compute the same answer over `data_pipes`: `SELECT DISTINCT pipe_name WHERE start_at < cutoff`, `GROUP BY pipe_name HAVING COUNT(*) >= 1`, and `WHERE EXISTS (subquery)`. For a sunset audit you want a yes/no per pipeline, so `DISTINCT` is the cleanest fit. The sunset team does not want counts; they want the set of names to retire.

---

### The two traps

#### Step 1: Trap 1: row count is not pipeline count

Forgetting `DISTINCT`. Each row in `data_pipes` is a run, not a pipeline. A pipeline that ran nightly for six months before the cutoff would show up roughly 180 times. The sunset team gets a 30k-row dump that looks like garbage, opens a ticket, and assumes the ingestion is broken. The shape of the answer (set of names) must match the shape of the question (which pipelines).

#### Step 2: Trap 2: half-open intervals

Date boundary off-by-one. `start_at < '2026-07-01'` means strictly before midnight on July 1, which is exactly what the prompt asks. If you write `start_at <= '2026-07-01'` against a TIMESTAMP, runs at 00:00:00 on July 1 sneak in. If you write `start_at <= '2026-06-30'` against a TIMESTAMP, you silently drop every run during the day of June 30. Strict-less-than against the start of the day after the window is the bug-free pattern.

#### Step 3: Trap 3: smuggling in a status filter

Assuming "at least one run" means "at least one successful run." The prompt never mentions `status`. If you add `AND status = 'success'`, pipelines that exist but only ever failed disappear from the sunset list, and dead-but-noisy pipelines stay in production forever. Read the prompt literally; the question is about existence of any run, not quality of runs.

---

### The solution

**Existence question, DISTINCT answer**

```sql
SELECT DISTINCT pipe_name
FROM data_pipes
WHERE start_at < '2026-07-01';
```

> **Why this stays fast as it scales**
>
> At 60k rows this is instant, but the shape generalizes. `DISTINCT pipe_name` is a hash aggregation over a low-cardinality column (a few dozen pipelines), so the dedup itself is near-free. The cost is the filter. A composite index on `(start_at, pipe_name)` lets the planner range-seek by date and pull only matching pipe_names from the index, skipping a heap fetch. Without it: full scan plus hash dedup. At 60M rows that is seconds versus minutes.

> **What to surface before writing SQL**
>
> A senior candidate raises two questions before writing code. First: does "any run" mean any status, or only successful runs? Second: does "had at least one run before July 1" include pipelines whose first run was before the cutoff but that are still running today? The past tense plus "at least one" implies a pure existence check, no status or recency filter, but stating the assumption aloud is what earns the signal.

> **DISTINCT vs GROUP BY: pick by intent**
>
> The seductive wrong answer: `SELECT pipe_name FROM data_pipes WHERE start_at < '2026-07-01' GROUP BY pipe_name`. It returns the correct rows, but the reader has to mentally execute the `GROUP BY` to realize it is deduplicating. `DISTINCT` signals existence; `GROUP BY` signals aggregation. Use the form that matches your intent, especially in code that other engineers will read at 2am during a sunset rollback.

> **When the two forms diverge**
>
> Postgres and SQLite plan `DISTINCT col` and `GROUP BY col` identically; the optimizer rewrites one into the other. Older MySQL (pre-8.0) sometimes chose worse plans for `DISTINCT` and was faster with `GROUP BY`. If a teammate insists `GROUP BY` is faster, ask which engine and which version before rewriting.

---

## Common follow-up questions

- Rewrite this for pipelines that ran ONLY before July 1 and never after. What changes? _(Pushes you toward an anti-join or NOT EXISTS: pipelines whose max(start_at) is still under the cutoff. This is the real sunset candidate list.)_
- What is the difference between DISTINCT and GROUP BY here, in terms of both performance and reader clarity? _(Tests whether you can articulate intent versus plan. On Postgres and SQLite the plans are identical; the difference is readability. On older MySQL, plans diverge.)_
- If `pipe_name` has NULLs, does DISTINCT collapse them into one row or treat each NULL as its own group? _(Standard SQL collapses all NULLs into one DISTINCT row, treating NULL = NULL for set semantics even though it is NULL in WHERE. This trips people up.)_
- How would you write this with `WHERE EXISTS`, and when is EXISTS the better choice than DISTINCT? _(EXISTS shines when you are filtering one table by membership in another. Here, with a single table, DISTINCT is simpler. Discuss when each is the right tool.)_
- Production wants the true sunset list: pipelines that ran before the cutoff and have had zero activity since. Write that query. _(Forces the candidate to think about the negative case: pipelines that exist in the table but have no runs at or after the cutoff. NOT EXISTS or anti-join.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/the_legacy_hunt)
- [SQL Interview Questions](https://datadriven.io/sql-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.