# Annual Pipeline Failures

> How many pipelines broke this year?

Canonical URL: <https://datadriven.io/problems/annual_pipeline_failures>

Domain: SQL · Difficulty: easy · Seniority: L3

## Problem

The data engineering team is reviewing historical reliability for the 'etl_users' pipeline. Count the number of failed runs per year, excluding any runs without a recorded start time. Present results from the earliest year to the latest.

## Worked solution and explanation

### Why this problem exists in real interviews

Interviewers use the `data_pipes` table here to probe grouped aggregation. The columns `pipe_name`, `status`, `rows_in` force candidates to reason about the correct grain before writing any aggregation.

---

### Break down the requirements

#### Step 1: Filter to qualifying rows

The WHERE clause narrows to relevant rows before grouping: `status = 'failed'`.

#### Step 2: Group by `STRFTIME('%Y'`

`GROUP BY` at the correct grain produces one row per group.

#### Step 3: Compute `COUNT(*)`

The COUNT function counts rows per group.

#### Step 4: Order by the metric

Sort by `cnt` desc for readability.

---

### The solution

**Group-aggregate for annual pipeline failures**

```sql
SELECT
    STRFTIME('%Y', start_at) AS year, STRFTIME('%Y', start_at), status,
    COUNT(*) AS cnt
FROM data_pipes
WHERE status = 'failed'
GROUP BY STRFTIME('%Y', start_at), status
ORDER BY cnt DESC
```

> **Cost Analysis**
>
> The main table has 50K rows. The GROUP BY reduces the row count early, keeping downstream operations cheap.

> **Interviewers Watch For**
>
> Strong candidates state the correct `GROUP BY` grain before writing any SQL, showing they think about the output shape first.

> **Common Pitfall**
>
> Selecting a non-aggregated column without including it in `GROUP BY` is the most common error. Some engines reject it; others silently return arbitrary values.

---

## Common follow-up questions

- The `start_at` column in `data_pipes` has roughly 2% NULLs. How does your query handle those rows, and would the result change if NULLs were replaced with zeros? _(Tests whether the candidate understands how NULLs propagate through aggregation functions and whether their WHERE/JOIN conditions implicitly filter them out.)_
- Your GROUP BY aggregates `pipe_id` from `data_pipes`. If two groups have the same aggregate value, how is the output ordered, and is that deterministic? _(Tests awareness that ORDER BY on a non-unique value produces non-deterministic row order without a tiebreaker.)_
- The `pipe_name` column in `data_pipes` has a zipf distribution, meaning a few values dominate. How does that skew affect your query plan and parallelism? _(Tests understanding of data skew: the optimizer may choose a bad plan when histogram statistics are stale.)_
- If `pipe_id` in `data_pipes` contained negative values, would your query still produce correct results? _(Tests whether the candidate validated assumptions about the domain of numeric columns.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/annual_pipeline_failures)
- [SQL Interview Questions](https://datadriven.io/sql-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.