# Highest Throughput Pipelines

> The pipes that carry the most water.

Canonical URL: <https://datadriven.io/problems/highest_throughput_pipelines>

Domain: SQL · Difficulty: medium · Seniority: L3

## Problem

The data platform team is sizing infrastructure for next year based on 2025 peak loads. For each pipeline, show its maximum rows output from any single run that year, sorted from highest throughput to lowest.

## Worked solution and explanation

### Why this problem exists in real interviews

Capacity planning teams budget infrastructure around peak load, not average load. Interviewers ask this kind of question to see whether a candidate reaches for `MAX` per group rather than `SUM` or `AVG`, and whether they correctly extract the year from a TEXT timestamp. Getting those two choices right separates engineers who reason about the metric from those who pattern-match on 'group by name'.

---

### Break down the requirements

#### Step 1: Filter to the reference year

`start_at` is stored as TEXT, so use `strftime('%Y', start_at) = '2026'` to keep only runs that started in 2025. This sits in WHERE because it is a row predicate that can be evaluated before aggregation.

#### Step 2: Aggregate peak rows_out per pipeline

GROUP BY `pipe_name` and compute `MAX(rows_out)`. MAX (not SUM, not AVG) is what answers 'peak throughput from any single run'. Each row in `data_pipes` is one run, so MAX picks the largest output across that pipeline's runs in 2026.

#### Step 3: Sort highest to lowest

`ORDER BY max_throughput DESC` lines pipelines up so the heaviest hitters surface first. Project just two columns (`pipe_name` and `max_throughput`) since the prompt asks for the pipeline and its peak.

---

### The solution

**Year filter, MAX per pipeline, ranked**

```sql
SELECT pipe_name, MAX(rows_out) AS max_throughput FROM data_pipes WHERE strftime('%Y', start_at) = '2026' GROUP BY pipe_name ORDER BY max_throughput DESC
```

> **Cost Analysis**
>
> `data_pipes` is 120K rows (~46 MB), so even a full scan is cheap. The bigger concern is that `strftime('%Y', start_at)` is a function on a column, which prevents index use on `start_at`. If this table grew to billions of rows you would store `start_year` as a precomputed column or partition by year so the filter can prune partitions instead of scanning every row.

> **Interviewers Watch For**
>
> Interviewers look for two things: did you pick MAX (peak run) instead of SUM (cumulative throughput) or AVG (typical run), and did you handle the TEXT timestamp correctly? Strong candidates also note that `pipe_name` has only 50 distinct values, so the GROUP BY produces a tiny result set even though the input scan is full-table.

> **Common Pitfall**
>
> Writing `WHERE start_at LIKE '2026%'` works for ISO-formatted strings but breaks if the format ever changes (e.g., to `MM/DD/YYYY`). Using `strftime` is format-aware. Another trap: using `SUM(rows_out)` instead of `MAX(rows_out)` answers a different question, total annual throughput rather than peak single-run throughput, which is what capacity planning actually needs.

---

## Common follow-up questions

- `start_at` has roughly 1% NULLs. What happens to those rows under your filter, and is that the behavior you want? _(Tests NULL handling. `strftime('%Y', NULL)` returns NULL, and `NULL = '2026'` is NULL (not TRUE), so NULL-`start_at` runs drop out of the filter. That is usually correct here, since a run with no start time cannot be attributed to a year.)_
- If product wanted both peak throughput and the date that peak occurred, how would the query change? _(Tests whether the candidate knows when MAX is no longer enough. They need a window function or a self-join: `ROW_NUMBER() OVER (PARTITION BY pipe_name ORDER BY rows_out DESC)` keeps the peak row's `start_at` accessible, where a plain `GROUP BY` discards it.)_
- `status` has 4 distinct values. Should you filter to only `status = 'success'` runs before computing peak throughput, and why? _(Tests product reasoning. A failed run might have written partial output, so its `rows_out` may not represent real throughput. Most capacity planners filter to successful runs only. The candidate should ask the question rather than assume.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/highest_throughput_pipelines)
- [SQL Interview Questions](https://datadriven.io/sql-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.