# Pipeline Completion Rate

> How far do users get through the flow?

Canonical URL: <https://datadriven.io/problems/pipeline_completion_rate>

Domain: SQL · Difficulty: medium · Seniority: L3

## Problem

For each pipeline, calculate the average completion percentage, where completion is the ratio of rows out to rows in. Exclude runs where rows in is zero. Show the pipeline name and its average completion percentage.

## Worked solution and explanation

### Why this problem exists in real interviews

Built around the `data_pipes` table, this challenge probes your ability to apply grouped AVG aggregation in a data pipeline setting. Correctly referencing columns like `rows_in`, `pipe_name`, and `rows_out` is essential to a working solution.

---

### Break down the requirements

#### Step 1: Exclude zero-input runs

`WHERE rows_in > 0` prevents division by zero.

#### Step 2: Group by pipeline name

`GROUP BY pipe_name` aggregates all runs of the same pipeline.

#### Step 3: Compute average completion percentage

`AVG(rows_out * 100.0 / rows_in)` gives the average completion rate per pipeline.

---

### The solution

**Exclude zero-input runs to find pipeline completion rate**

```sql
SELECT
    pipe_name,
    AVG(rows_out * 100.0 / rows_in) AS avg_completion_pct
FROM data_pipes
WHERE rows_in > 0
GROUP BY pipe_name
```

> **Cost Analysis**
>
> The query scans `data_pipes` (100,000 rows). A covering index on the filter and group columns would reduce I/O. At this scale, the full scan is acceptable but becomes costly if the table grows 10x.

> **Interviewers Watch For**
>
> Interviewers evaluate whether you translate the English requirements into the correct SQL clauses on the first attempt. They watch for clean syntax, correct column references, and whether you verify edge cases before declaring the query complete.

> **Common Pitfall**
>
> The most common mistake is misreading the prompt's filtering or grouping requirements. Double-check which columns to group by, which to aggregate, and whether the output should be filtered with `WHERE` (before grouping) or `HAVING` (after grouping).

---

## Common follow-up questions

- The `start_at` column in `data_pipes` has a 1% null rate. How does your query handle rows where `start_at` is NULL, and could that silently change the result count? _(Tests whether the candidate accounts for NULLs in `data_pipes.start_at` and understands how aggregates skip NULL values.)_
- The `pipe_name` column in `data_pipes` is heavily skewed toward a few popular values. How would data skew affect parallel execution of your query? _(Tests understanding of skew in `data_pipes.pipe_name` and its impact on distributed query performance.)_
- If this query ran as a scheduled job, how would you add monitoring to detect when the result set is suspiciously empty? _(Tests operational awareness around scheduled query jobs.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/pipeline_completion_rate)
- [SQL Interview Questions](https://datadriven.io/sql-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.