# Pipeline Duration vs Throughput

> Does throughput correlate with duration?

Canonical URL: <https://datadriven.io/problems/pipeline_duration_vs_throughput>

Domain: SQL · Difficulty: hard · Seniority: L5

## Problem

Does pipeline throughput correlate with duration? For each pipeline, calculate the average duration and the average net row output (rows_out minus 10% of rows_in as overhead). Then compute the correlation between average duration and average net output across all pipelines, rounded to 2 decimals. Return a single correlation value.

## Worked solution and explanation

### Why this problem exists in real interviews

Observability dashboards over pipeline runs often ask 'do slower pipelines ship fewer rows?' The honest answer is a correlation coefficient, and the test is whether the candidate can spell out Pearson's formula in SQL without a stats library. It also tests two-stage aggregation: the correlation is between pipelines, not between individual runs, so the candidate has to collapse to the per-pipeline grain first.

> **Trick to Solving**
>
> The word "correlation" with no library implies manual Pearson. Pearson needs three sums: `SUM((x-x_mean)(y-y_mean))`, `SUM((x-x_mean)^2)`, `SUM((y-y_mean)^2)`. Before you can compute them, collapse each pipeline to one (x, y) point by averaging its runs.

---

### Break down the requirements

#### Step 1: Collapse runs to per-pipeline averages

In a CTE, `GROUP BY pipe_name` and compute `AVG(dur_secs)` as the x and `AVG(rows_out - 0.1 * rows_in)` as the y. Each pipeline becomes one point.

#### Step 2: Compute the global means of x and y

Take `AVG(avg_dur)` and `AVG(avg_net_output)` over the CTE. These are the centroids Pearson needs. Cross-join them back to the CTE so every row has access to both means.

#### Step 3: Apply the Pearson formula

Numerator: `SUM((x - x_mean) * (y - y_mean))`. Denominator: `SQRT(SUM((x - x_mean)^2) * SUM((y - y_mean)^2))`. Divide and round to two decimals.

#### Step 4: Return a single row

The answer is one scalar named `correlation`. No GROUP BY on the outer query.

---

### The solution

**Manual Pearson correlation across pipelines**

```sql
WITH pipe_stats AS (
    SELECT pipe_name,
           AVG(dur_secs)                      AS avg_dur,
           AVG(rows_out - 0.1 * rows_in)      AS avg_net_output
    FROM data_pipes
    GROUP BY pipe_name
),
means AS (
    SELECT AVG(avg_dur) AS dur_mean,
           AVG(avg_net_output) AS out_mean
    FROM pipe_stats
)
SELECT ROUND(
    SUM((ps.avg_dur - m.dur_mean) * (ps.avg_net_output - m.out_mean))
    / (SQRT(SUM((ps.avg_dur - m.dur_mean) * (ps.avg_dur - m.dur_mean)))
       * SQRT(SUM((ps.avg_net_output - m.out_mean) * (ps.avg_net_output - m.out_mean)))),
    2
) AS correlation
FROM pipe_stats ps
CROSS JOIN means m
```

> **Cost Analysis**
>
> `data_pipes` has 150,000 runs across ~55 pipelines. The inner aggregate scans once to produce 55 rows, and every downstream operation works on those 55 rows. The outer Pearson math is free. At 10x scale the inner aggregate is still the only expensive step.

> **Interviewers Watch For**
>
> Two signals. First, the candidate collapses to per-pipeline before computing correlation, rather than correlating raw runs. Second, the candidate spells out Pearson from first principles without asking for a library function. A strong candidate also mentions the division-by-zero risk when variance in either variable is zero.

> **Common Pitfall**
>
> Integer arithmetic silently truncates `0.1 * rows_in` if `rows_in` is an INTEGER and the engine evaluates the literal as INTEGER. Writing `0.1` (already a DOUBLE) avoids the trap, but `1/10 * rows_in` would truncate to zero. The other common trap: computing correlation over raw runs, which answers a different question.

---

## Common follow-up questions

- What if some runs have NULL `dur_secs`? _(`AVG` ignores NULLs, so per-pipeline averages still work, but pipelines where every run has NULL drop out. Decide whether to exclude those pipelines or treat them as zero.)_
- How would you handle a pipeline whose variance in duration is zero? _(Pearson's denominator becomes zero and the query errors or returns NULL. Guard with `NULLIF` around the denominator or filter such pipelines out.)_
- How would you express this as Spearman correlation instead? _(Replace the raw averages with `RANK()` over the two variables and apply the Pearson formula to the ranks.)_
- Does your engine materialize the CTE or inline it? _(SQLite materializes CTEs by default. In Postgres, `WITH pipe_stats AS ...` was an optimization fence pre-12 and is inlined by default from 12+. Matters when `pipe_stats` is referenced twice.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/pipeline_duration_vs_throughput)
- [SQL Interview Questions](https://datadriven.io/sql-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.