# Successful Build Duration by Repository

> CI throughput, repo by repo.

Canonical URL: <https://datadriven.io/problems/successful_build_duration_by_repository>

Domain: SQL · Difficulty: medium · Seniority: L4

## Problem

For each repository, sum the total duration of successful builds in January 2026. Repositories with no successful builds should still appear with a total of zero. Return the repository name and total duration in seconds.

## Worked solution and explanation

### Why this problem exists in real interviews

This tests whether a candidate can demonstrate recognizing when a self-join is the right pattern, preserving all rows with outer joins, and handling nulls safely in expressions. This pattern appears frequently in mid-level SQL rounds where interviewers want to see structured thinking.

> **Trick to Solving**
>
> When the prompt asks for pairs or combinations within the same table, a self-join is the pattern.
> 
> 1. Alias the table twice (e.g., `a` and `b`)
> 2. Join on the shared attribute (region, user, etc.)
> 3. Add `a.id < b.id` to avoid duplicate and self-pairs

---

### Break down the requirements

#### Step 1: Self-join the table to pair rows

Join `ci_builds` to itself to compare rows within the same table. The join condition controls which pairs are valid and prevents duplicate mirrors.

#### Step 2: Aggregate by `repos.repo_name`

`GROUP BY repos.repo_name` collapses rows to one per group. The aggregate functions (`SUM`, `COUNT`, `AVG`, etc.) compute the metric for each group.

#### Step 3: Deduplicate the result with DISTINCT

`SELECT DISTINCT` removes duplicate rows from the output. This is necessary when joins or subqueries can produce repeated combinations.

#### Step 4: Use a subquery to find the reference value

The scalar subquery computes a single value (like the maximum) that the outer query filters against. This avoids a self-join.

---

### The solution

**Left join with conditional sum to retain zero-success repos**

```sql
SELECT repos.repo_name, COALESCE(SUM(cb.dur_secs), 0) AS total_dur_secs
FROM (SELECT DISTINCT repo_name FROM ci_builds) repos
LEFT
JOIN ci_builds cb ON repos.repo_name = cb.repo_name AND cb.status = 'success' AND cb.built_at BETWEEN '2026-01-01' AND '2026-01-31'
GROUP BY repos.repo_name
```

> **Cost Analysis**
>
> With ~2M rows, the GROUP BY reduces the working set before any downstream operations; the join cost depends on the smaller table's cardinality. An index on the filter/join columns would reduce the scan to a seek.

> **Interviewers Watch For**
>
> Interviewers watch for whether you choose the correct join type to avoid silently dropping rows; how you handle NULL values and whether you account for them in filters and aggregations; how you prevent duplicate pairs and ensure the join condition is correct.

> **Common Pitfall**
>
> Using INNER JOIN instead of LEFT JOIN drops rows with no match, producing an incomplete result. The prompt usually hints at this with 'all' or 'even if no'.

---

## Common follow-up questions

- If you use SUM(CASE WHEN status = 'success' THEN dur_secs END), what value appears for a repo that had no successful builds, and how do you coerce it to zero? _(Tests knowledge that SUM over no matching rows returns NULL, requiring COALESCE.)_
- How would you restrict to January of a specific year when built_at is stored as TEXT rather than a proper date type? _(Tests string-based date filtering and awareness of format assumptions.)_
- If two builds for the same repo overlap in time, does summing dur_secs overcount actual wall-clock duration? _(Tests understanding of the difference between summed durations and elapsed time for parallel builds.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/successful_build_duration_by_repository)
- [SQL Interview Questions](https://datadriven.io/sql-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.