# Low Throughput Pipelines

> Pipelines barely moving data.

Canonical URL: <https://datadriven.io/problems/low_throughput_pipelines>

Domain: SQL · Difficulty: easy · Seniority: L3

## Problem

The data platform team suspects some pipeline runs are producing suspiciously low output. Surface every individual run with fewer than 2000 rows output, showing the pipeline name and row count from highest to lowest.

## Worked solution and explanation

### Why this problem exists in real interviews

The data_pipes table contains pipe_name and status values that must be processed with query construction. This appears as a fundamentals check to probe whether you reason about the correct aggregation grain before writing any window or GROUP BY clause.

---

### Break down the requirements

#### Step 1: Filter to the target rows

Apply the `WHERE` filter to restrict the working set before aggregation. Filtering early reduces the number of rows that downstream operations process.

#### Step 2: Order the final output

Apply `ORDER BY` as specified to produce the expected row sequence. When tied values exist, add a secondary sort column for determinism.

---

### The solution

**Below-threshold filter with descending sort**

```sql
SELECT pipe_name, rows_out
FROM data_pipes
WHERE rows_out < 2000
ORDER BY rows_out DESC
```

> **Cost Analysis**
>
> The query scans 70K rows from `data_pipes`. CTEs in most engines are optimization fences. For production workloads, consider inlining or materializing the intermediate results.

> **Interviewers Watch For**
>
> Naming the output grain ("one row per X") before writing the GROUP BY shows you think about data shape, not just syntax. Breaking complex logic into named CTEs shows the interviewer you prioritize readability and debuggability.

> **Common Pitfall**
>
> Returning more columns than the prompt asks for can trigger a "wrong schema" failure in automated grading. Match the output specification exactly.

---

## Common follow-up questions

- What happens to your result if data_pipes.start_at contains NULLs for some rows? _(Tests whether the candidate accounts for NULL behavior in aggregates and comparisons on start_at.)_
- How would you verify that your aggregation on data_pipes.pipe_id is not double-counting due to duplicate rows? _(Tests data quality awareness and deduplication strategies.)_
- If data_pipes grows to hundreds of millions of rows, how would you partition or index on start_at to maintain performance? _(Tests partitioning strategy for time-series data in start_at.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/low_throughput_pipelines)
- [SQL Interview Questions](https://datadriven.io/sql-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.