# Pipeline Throughput Ratio

> Compute current-to-initial value ratio per period.

Canonical URL: <https://datadriven.io/problems/pipeline_throughput_ratio>

Domain: SQL · Difficulty: easy · Seniority: L5

## Problem

The data platform team is measuring pipeline efficiency as the ratio of rows output to rows input per run, rounded to 4 decimal places. Show each run's pipeline name, start time, and throughput ratio, ordered by pipeline name and start time.

## Worked solution and explanation

### Why this problem exists in real interviews

This problem targets filtering and projection across the `data_pipes` table. You need to work with columns like `rows_in`, `rows_out`, and `pipe_name` to satisfy the requirements.

---

### Break down the requirements

#### Step 1: Compute the ratio

`ROUND(rows_out * 1.0 / rows_in, 4)` calculates throughput as a decimal to 4 places. The `* 1.0` prevents integer division.

#### Step 2: Select and order

Return `pipe_name`, `start_at`, and the computed ratio, ordered by `pipe_name, start_at`.

---

### The solution

**Compute the ratio to find pipeline throughput ratio**

```sql
SELECT
    pipe_name,
    start_at,
    ROUND(rows_out * 1.0 / rows_in, 4) AS throughput_ratio
FROM data_pipes
ORDER BY pipe_name, start_at
```

> **Cost Analysis**
>
> With `data_pipes` (50,000 rows), this query scans a small dataset. No indexing is needed for this volume. At production scale, an index on the primary filter column would improve performance.

> **Interviewers Watch For**
>
> Interviewers evaluate whether you translate the English requirements into the correct SQL clauses on the first attempt. They watch for clean syntax, correct column references, and whether you verify edge cases before declaring the query complete.

> **Common Pitfall**
>
> The most common mistake is misreading the prompt's filtering or grouping requirements. Double-check which columns to group by, which to aggregate, and whether the output should be filtered with `WHERE` (before grouping) or `HAVING` (after grouping).

---

## Common follow-up questions

- If `start_at` in `data_pipes` is NULL for some rows, how would your aggregation or join logic be affected? _(Probes understanding of NULL propagation through joins and aggregate functions on `data_pipes.start_at`.)_
- The `rows_in` column in `data_pipes` is heavily skewed toward a few popular values. How would data skew affect parallel execution of your query? _(Tests understanding of skew in `data_pipes.rows_in` and its impact on distributed query performance.)_
- How would you modify this query if the business logic required grouping by both `pipe_id` and `pipe_name` instead of just one? _(Tests ability to adapt the query structure to changing requirements.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/pipeline_throughput_ratio)
- [SQL Interview Questions](https://datadriven.io/sql-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.