# Pipeline Recovery by Priority

> Recovery time, priority by priority.

Canonical URL: <https://datadriven.io/problems/pipeline_recovery_by_priority>

Domain: SQL · Difficulty: medium · Seniority: L3

## Problem

For each pipeline run that completed successfully after January 1, 2026, find every batch job that started on the same calendar day. Show the pipeline name and the batch job priority, along with the minimum, average, and maximum pipeline duration in seconds for each pipeline name and priority combination.

## Worked solution and explanation

### The mental model

Infra teams ask this to see if you spot a missing foreign key and name it. `data_pipes` and `batch_jobs` share no real relationship, just a calendar day. They want to hear you say 'there's no FK here, I'm inferring a same-day join' out loud, then handle the many-to-many cartesian explosion without inflating your aggregates. Silent assumption = silent rejection.

---

### What the aggregates actually measure

Here is the part that catches everyone. If 1 pipe run on Monday matches 4 batch jobs at priority HIGH, your join produces 4 copies of that pipe row, each tagged HIGH. Then `MIN/AVG/MAX(dur_secs)` runs over 4 copies of the same dur_secs value. MIN and MAX are unaffected (the dup gives the same number), but AVG is unaffected too because it averages the same value 4 times. So the aggregates are accidentally correct, but if you ever added `COUNT(*)` or `SUM(dur_secs)` you would inflate by the per-day fanout factor. Knowing why this works here teaches you the trap for next time.

---

### The solution

**Same-day calendar join, grouped by priority**

```sql
SELECT
    dp.pipe_name,
    bj.priority,
    MIN(dp.dur_secs) AS min_duration,
    AVG(dp.dur_secs) AS avg_duration,
    MAX(dp.dur_secs) AS max_duration
FROM data_pipes dp
INNER JOIN batch_jobs bj
    ON SUBSTR(dp.start_at, 1, 10) = SUBSTR(bj.started, 1, 10)
WHERE dp.start_at > '2026-01-01'
  AND dp.status = 'success'
GROUP BY dp.pipe_name, bj.priority
```

> **DATE() vs SUBSTR**
>
> `SUBSTR(start_at, 1, 10)` extracts `YYYY-MM-DD` from the ISO-8601 timestamp prefix. `DATE(start_at)` would do the same in SQLite. Either works; SUBSTR is portable to engines without a DATE function but breaks if your timestamp format is not ISO. **Prefer `DATE()` when available**, fall back to SUBSTR for raw text columns.

> **Why the planner can't use indexes here**
>
> Both sides have a function on the join column, which kills index seeks. The planner will full-scan both tables, hash on the derived date, and join. With 80k pipes and 400k jobs that is fine. At 100x scale it is not: the right fix is materializing `start_date AS DATE(start_at)` as a generated column on each table and indexing it. Then the join becomes a plain equality on indexed columns.

> **What interviewers actually score**
>
> Strong candidates ask three things up front. (1) 'Is there a real foreign key I am missing?' If yes, use it; same-day is a smell. (2) 'What does same-day mean for jobs that span midnight?' SUBSTR uses wall-clock date at the storage timezone, which can split a single logical run across two days. (3) 'If priority is on the job and duration is on the pipe, am I really measuring pipeline cost by job priority, or just correlating two unrelated things?' That last question is the senior-engineer move.

> **Status value gotcha**
>
> The status filter is `dp.status = 'success'`, not `'completed'`. Get the wrong value and your query silently returns zero rows. **Always inspect the distinct values of any column you filter on before writing the WHERE clause** (`SELECT DISTINCT status FROM data_pipes`). The dataset uses 'success', 'failed', 'running', 'cancelled'.

---

## Common follow-up questions

- Two pipes ran successfully on the same Tuesday. Five batch jobs ran that Tuesday: 3 at HIGH, 2 at LOW. How many rows does the join produce, and how many output rows after GROUP BY? _(Tests cardinality understanding. Answer: 2 pipes x 5 jobs = 10 rows from the join, then grouped to 2 (pipe_name) x 2 (priority) = up to 4 output rows. Each pipe's dur_secs appears in the aggregate as many times as jobs at that priority that day.)_
- How would you sanity-check whether the same-day join is inflating any of your aggregates? _(Tests the candidate's debugging instinct. Right answer: try SUM(dur_secs) too. If SUM is wildly off but AVG looks fine, you are catching cardinality inflation. The fix is usually deduplicating the right side before the join.)_
- Production runs in three timezones. A pipeline starts at 11:55pm PT and a batch job starts at 12:05am ET. Do they match? Should they? _(Tests timezone awareness. Real-world answer: store timestamps in UTC, do the SUBSTR there, and surface the user's local date in the presentation layer. Doing date math in mixed timezones causes off-by-one bugs that are nearly impossible to debug.)_
- Your VP says this query is slow and asks you to make it 10x faster without scaling hardware. What do you change? _(Tests practical schema design. Best answer: add a run_id or schedule_id linking pipes and jobs at the application layer. Failing that, materialize a generated column run_date on both tables and index it. SUBSTR-on-string is the worst option in production.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/pipeline_recovery_by_priority)
- [SQL Interview Questions](https://datadriven.io/sql-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.