# Total Rows by Pipeline Status

> Row counts alongside pipeline aggregates.

Canonical URL: <https://datadriven.io/problems/total_rows_by_pipeline_status>

Domain: SQL · Difficulty: easy · Seniority: L3

## Problem

The data platform team wants each pipeline run row enriched with a status-level benchmark: the total rows ingested across all pipelines sharing the same status. Preserve every original column and add the status-wide total as an additional column.

## Worked solution and explanation

### Why this problem exists in real interviews

This tests whether you know when to use a window function instead of GROUP BY. The prompt asks for a status-level aggregate appended to every row, which is exactly what `SUM() OVER (PARTITION BY ...)` does.

> **Trick to Solving**
>
> "Preserve every original column and add the status-wide total" is the signal for a window function instead of GROUP BY. GROUP BY would collapse rows; a window function enriches each row.
> 
> 1. Spot that the output has the same row count as the input
> 2. Use `SUM(rows_in) OVER (PARTITION BY status)` to compute the status-level total
> 3. Select all original columns plus the new window column

---

### Break down the requirements

#### Step 1: Identify the enrichment pattern

The output must have all original columns from `data_pipes` (80K rows) plus one new column. This means no GROUP BY, just a window function.

#### Step 2: Compute status-level total as a window

`SUM(rows_in) OVER (PARTITION BY status)` computes the total `rows_in` for each status and attaches it to every row sharing that status.

---

### The solution

**Window function for row-level enrichment**

```sql
SELECT
    pipe_id,
    pipe_name,
    status,
    rows_in,
    rows_out,
    start_at,
    dur_secs,
    SUM(rows_in) OVER (PARTITION BY status) AS status_total_rows_in
FROM data_pipes
```

> **Cost Analysis**
>
> With only 80K rows and 4 distinct statuses, the window function partitions into 4 groups of ~20K rows each. This is very fast with no sorting required (SUM does not need ORDER BY).

> **Interviewers Watch For**
>
> Candidates who use GROUP BY and then join back to the original table. While correct, this is more complex and slower than a direct window function.

> **Common Pitfall**
>
> Adding ORDER BY inside the OVER clause. `SUM(...) OVER (PARTITION BY status ORDER BY ...)` produces a running sum, not a partition-level total. Omit ORDER BY for the full partition aggregate.

---

## Common follow-up questions

- What if you needed a running total within each status ordered by start time? _(Tests adding ORDER BY start_at to the window frame for cumulative sums.)_
- How would you also include the overall total across all statuses? _(Add a second window: `SUM(rows_in) OVER ()` with no partition.)_
- What if rows_in had NULL values? _(SUM ignores NULLs, so the total would exclude them. COALESCE might be needed.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/total_rows_by_pipeline_status)
- [SQL Interview Questions](https://datadriven.io/sql-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.