# Pipeline Run History

> The lineage trail.

Canonical URL: <https://datadriven.io/problems/pipeline_run_history>

Domain: SQL · Difficulty: easy · Seniority: L3

## Problem

For each pipeline in data_pipes, show the first run date, the most recent run date, and the total number of separate runs, with the most recently started pipelines first.

## Worked solution and explanation

### Why this problem exists in real interviews

Drawn from a data pipeline domain, this question centers on grouped COUNT aggregation over the `data_pipes` table. The tricky part is handling the `pipe_name` and `start_at` columns correctly under the given constraints.

---

### Break down the requirements

#### Step 1: Group by pipeline name

`GROUP BY pipe_name` collapses all runs into one summary row per pipeline.

#### Step 2: Compute first run, last run, and count

`MIN(start_at)`, `MAX(start_at)`, and `COUNT(*)` give the timeline and frequency.

#### Step 3: Order by most recent first

`ORDER BY MAX(start_at) DESC` surfaces the most recently active pipelines.

---

### The solution

**Group by pipeline name to find pipeline run history**

```sql
SELECT
    pipe_name,
    MIN(start_at) AS first_run,
    MAX(start_at) AS last_run,
    COUNT(*) AS total_runs
FROM data_pipes
GROUP BY pipe_name
ORDER BY MAX(start_at) DESC
```

> **Cost Analysis**
>
> With `data_pipes` (60,000 rows), this query scans a small dataset. No indexing is needed for this volume. At production scale, an index on the primary filter column would improve performance.

> **Interviewers Watch For**
>
> Interviewers evaluate whether you translate the English requirements into the correct SQL clauses on the first attempt. They watch for clean syntax, correct column references, and whether you verify edge cases before declaring the query complete.

> **Common Pitfall**
>
> The most common mistake is misreading the prompt's filtering or grouping requirements. Double-check which columns to group by, which to aggregate, and whether the output should be filtered with `WHERE` (before grouping) or `HAVING` (after grouping).

---

## Common follow-up questions

- If `start_at` in `data_pipes` is NULL for some rows, how would your aggregation or join logic be affected? _(Probes understanding of NULL propagation through joins and aggregate functions on `data_pipes.start_at`.)_
- The `pipe_name` column in `data_pipes` is heavily skewed toward a few popular values. How would data skew affect parallel execution of your query? _(Tests understanding of skew in `data_pipes.pipe_name` and its impact on distributed query performance.)_
- `data_pipes.status` only has 4 distinct values. If a new category were added, would your query automatically include it? _(Tests whether the query hard-codes values or dynamically adapts to `status` changes.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/pipeline_run_history)
- [SQL Interview Questions](https://datadriven.io/sql-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.