# Longest Running Pipeline

> One pipeline outlasted them all.

Canonical URL: <https://datadriven.io/problems/longest_running_pipeline>

Domain: SQL · Difficulty: medium · Seniority: L3

## Problem

Which data pipeline ran the longest? Return just that pipeline's name.

## Worked solution and explanation

### Why this problem exists in real interviews

Against the data_pipes table, top-N selection and running total on pipe_name values is the key operation. Interviewers favor this in mid-level screens because it exposes whether candidates handle ties, NULLs, and ordering edge cases correctly.

> **Trick to Solving**
>
> Read the prompt carefully for implicit constraints. The phrase structure hints at the grain of the output: what each row represents.
> 
> 1. Identify the output grain from the prompt (one row per what?)
> 2. Work backward from the desired output columns
> 3. Build the query inside-out: innermost subquery first, then layer on filters and aggregates

---

### Break down the requirements

#### Step 1: Read from `data_pipes`

The query targets `data_pipes` with 7 columns. Identify which columns are needed for the output.

#### Step 2: Order and limit the output

Sort by the target metric and apply `LIMIT` to return the requested number of rows. Ensure the sort is deterministic to produce reproducible results.

#### Step 3: Return the result set

Select the required columns with any necessary aliasing or formatting.

---

### The solution

**Direct sort for longest pipeline**

```sql
SELECT pipe_name
FROM data_pipes
ORDER BY dur_secs DESC
LIMIT 1
```

> **Cost Analysis**
>
> The query scans 80K rows from `data_pipes`.

> **Interviewers Watch For**
>
> Candidates who verbalize their approach before typing, naming the output columns and expected row count, consistently perform better.

> **Common Pitfall**
>
> Returning more columns than the prompt asks for can trigger a "wrong schema" failure in automated grading. Match the output specification exactly.

---

## Common follow-up questions

- What happens to your result if data_pipes.start_at contains NULLs for some rows? _(Tests whether the candidate accounts for NULL behavior in aggregates and comparisons on start_at.)_
- How would you verify that your aggregation on data_pipes.pipe_id is not double-counting due to duplicate rows? _(Tests data quality awareness and deduplication strategies.)_
- If data_pipes grows to hundreds of millions of rows, how would you partition or index on start_at to maintain performance? _(Tests partitioning strategy for time-series data in start_at.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/longest_running_pipeline)
- [SQL Interview Questions](https://datadriven.io/sql-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.