# Build Success vs Failure by Repo

> Green versus red, repo by repo.

Canonical URL: <https://datadriven.io/problems/build_success_vs_failure_by_repo>

Domain: SQL · Difficulty: medium · Seniority: L3

## Problem

The engineering lead needs a per-repo build health snapshot before freezing deploys. For each repository, show how many builds have a status of 'success', how many have a status of 'failure', and the average build duration, ordered alphabetically by repo.

## Worked solution and explanation

### What this is really asking

`status` carries more than two values (success, failure, plus running, canceled, timeout), but `avg_duration` is over every build. The conditional aggregation has to count two values and average across all of them in one pass.

---

### Break down the requirements

#### Step 1: Group once per repo

Single `GROUP BY repo_name` over the 3M row table. Every metric lives in the same aggregate, no joins or subqueries.

#### Step 2: Branch the counts with CASE

`SUM(CASE WHEN status = 'success' THEN 1 ELSE 0 END)` and the mirror for 'failure'. The ELSE 0 keeps in-flight and canceled builds in the table without inflating either count.

#### Step 3: Average duration over everything

`AVG(dur_secs)` runs across all statuses for the repo. No `WHERE` clause filters them out; the engineering lead wants total throughput, not just terminal builds.

---

### The solution

**BUILD SUCCESS VS FAILURE BY REPO**

```sql
SELECT repo_name,
       SUM(CASE WHEN status = 'success' THEN 1 ELSE 0 END) AS success_count,
       SUM(CASE WHEN status = 'failure' THEN 1 ELSE 0 END) AS failure_count,
       AVG(dur_secs) AS avg_duration
FROM ci_builds
GROUP BY repo_name
ORDER BY repo_name;
```

> **Cost Analysis**
>
> One sequential scan, one hash aggregate keyed by repo_name. CASE evaluates per row but allocates nothing extra. Sort for `ORDER BY` runs on the small grouped result, not the 3M row input.

> **Interviewers Watch For**
>
> Whether you reach for conditional aggregation or write two separate queries and join them. Joining on repo_name produces the same numbers but doubles the scan and drops repos missing either status.

> **Common Pitfall**
>
> Using `COUNT(CASE WHEN status = 'success' THEN 1 END)` works because COUNT ignores NULL, but switching to `COUNT(CASE WHEN ... THEN 1 ELSE 0 END)` quietly counts every row since 0 is non-null. Pick `SUM` with `ELSE 0` or `COUNT` with no ELSE.

> **The False Start**
>
> First instinct is `WHERE status IN ('success', 'failure')` to narrow down before aggregating. That gives correct counts but changes `AVG(dur_secs)` to exclude canceled and running builds, shifting the duration mean. Pivot to no `WHERE` clause and let CASE branch the counts.

---

### COMMON FOLLOW-UP QUESTIONS

## Common follow-up questions

- How would you add a success rate column? _(Divide success_count by `COUNT(*)` or by `(success_count + failure_count)` depending on whether in-flight builds belong in the denominator; cast to a decimal to avoid integer division.)_
- What if the lead only cares about the main branch? _(Add `WHERE branch = 'main'` before the GROUP BY; the conditional aggregation pattern stays unchanged and still handles every status.)_
- How would you flag repos where the failure count tripled week over week? _(Add `DATE_TRUNC('week', built_at)` to the grouping, then `LAG(failure_count)` partitioned by repo ordered by week and filter where the ratio exceeds three.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/build_success_vs_failure_by_repo)
- [SQL Interview Questions](https://datadriven.io/sql-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.