# Pipeline Overhead by Environment

> Production overhead versus staging.

Canonical URL: <https://datadriven.io/problems/pipeline_overhead_by_environment>

Domain: SQL · Difficulty: medium · Seniority: L4

## Problem

The data platform team is comparing throughput overhead across pipeline states. Calculate the average absolute difference between rows in and rows out for successful pipelines and for running pipelines separately, then return the absolute gap between those two averages as a single value. Treat 'success' case-insensitively (so 'success' and 'Success' both count).

## Worked solution and explanation

### Why this problem exists in real interviews

Built around the `data_pipes` table, this challenge probes your ability to apply filtering and projection in a data pipeline setting. Correctly referencing columns like `status`, `rows_in`, and `rows_out` is essential to a working solution.

---

### Break down the requirements

#### Step 1: Compute average overhead for completed pipelines

`AVG(ABS(rows_in - rows_out))` where `status = 'completed'`.

#### Step 2: Compute average overhead for running pipelines

`AVG(ABS(rows_in - rows_out))` where `status = 'running'`.

#### Step 3: Take the difference

Subtract one from the other to get the overhead gap.

---

### The solution

**Compute average overhead for completed pipelines to find pipeline o...**

```sql
SELECT
    AVG(CASE WHEN status = 'running' THEN ABS(rows_in - rows_out) END)
    - AVG(CASE WHEN status = 'completed' THEN ABS(rows_in - rows_out) END) AS overhead_diff
FROM data_pipes
WHERE status IN ('completed', 'running')
```

> **Cost Analysis**
>
> With `data_pipes` (90,000 rows), this query scans a small dataset. No indexing is needed for this volume. At production scale, an index on the primary filter column would improve performance.

> **Interviewers Watch For**
>
> Interviewers evaluate whether you translate the English requirements into the correct SQL clauses on the first attempt. They watch for clean syntax, correct column references, and whether you verify edge cases before declaring the query complete.

> **Common Pitfall**
>
> The most common mistake is misreading the prompt's filtering or grouping requirements. Double-check which columns to group by, which to aggregate, and whether the output should be filtered with `WHERE` (before grouping) or `HAVING` (after grouping).

---

## Common follow-up questions

- If `dur_secs` in `data_pipes` is NULL for some rows, how would your aggregation or join logic be affected? _(Probes understanding of NULL propagation through joins and aggregate functions on `data_pipes.dur_secs`.)_
- The `dur_secs` column in `data_pipes` is heavily skewed toward a few popular values. How would data skew affect parallel execution of your query? _(Tests understanding of skew in `data_pipes.dur_secs` and its impact on distributed query performance.)_
- Your conditional CASE logic assumes the categories are exhaustive. What happens if a row in `data_pipes` falls into none of the branches? _(Tests awareness of the implicit ELSE NULL in CASE expressions.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/pipeline_overhead_by_environment)
- [SQL Interview Questions](https://datadriven.io/sql-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.