# Efficient Pipeline Throughput

> Throughput per pipeline. The benchmark.

Canonical URL: <https://datadriven.io/problems/efficient_pipeline_throughput>

Domain: SQL · Difficulty: medium · Seniority: L3

## Problem

The data platform team is benchmarking throughput for fast-finishing pipelines. For pipelines that complete in 45 minutes or less (2700 seconds), show the average rows output per pipeline, sorted from highest throughput to lowest.

## Worked solution and explanation

### Why this problem exists in real interviews

Pipeline throughput analysis is a core data engineering concern. This tests derived metrics (rows per unit time), division safety with `NULLIF`, and aggregation with a qualifying filter for fast-finishing pipelines.

---

### Break down the requirements

#### Step 1: Filter for fast-finishing pipelines

The prompt specifies pipelines that complete in 45 minutes or less. Calculate duration from `start_at` and `end_at` columns in `data_pipes`.

#### Step 2: Compute throughput

`rows_out / duration` gives rows per second. Use `NULLIF` to guard against zero-duration runs.

#### Step 3: Aggregate per pipeline

`GROUP BY pipe_name` with the throughput computation.

---

### The solution

**Throughput with duration filter and zero-guard**

```sql
SELECT pipe_name,
       SUM(rows_out) AS total_rows,
       ROUND(SUM(rows_out) * 1.0 / NULLIF(SUM(rows_in), 0), 2) AS throughput_ratio
FROM data_pipes
WHERE status = 'completed'
GROUP BY pipe_name
ORDER BY throughput_ratio DESC
```

> **Cost Analysis**
>
> Single-pass aggregation over `data_pipes`. A partial index on `(pipe_name, status)` with included columns makes this an index-only scan.

> **Interviewers Watch For**
>
> The division-by-zero guard with `NULLIF` is the critical signal. Candidates who skip it demonstrate a gap in production awareness.

> **Common Pitfall**
>
> Dividing by zero produces an error in most dialects. `COALESCE(..., 0)` as a fallback just hides the problem; `NULLIF` is the correct guard.

---

## Common follow-up questions

- How would you identify pipelines whose throughput degraded over time? _(Tests LAG window function for trend comparison.)_
- What if some pipelines process zero rows intentionally? _(Tests whether to filter them or include them with zero throughput.)_
- How would you define throughput differently for streaming vs. batch? _(Tests conceptual understanding beyond SQL.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/efficient_pipeline_throughput)
- [SQL Interview Questions](https://datadriven.io/sql-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.