# Average Build Duration by Repo

> Some repos build fast. Others don't.

Canonical URL: <https://datadriven.io/problems/average_build_duration_by_repo>

Domain: SQL · Difficulty: easy · Seniority: L3

## Problem

Slow builds are blocking deploys and the CI/CD team suspects a few repos are dragging down the pipeline. Show the average build duration in seconds for each repository so the team can identify the worst offenders.

## Worked solution and explanation

### Why this problem exists in real interviews

`ci_builds.status` is sitting right there in the schema and the prompt never tells you what to do with it. A cancelled build that died at second 4 and a timeout-killed build that ran 90 minutes both live in the same `dur_secs` column. Whoever wrote this prompt wants to hear you flag that before you blame any repo.

---

### Break down the requirements

#### Step 1: Group by repo, not by branch

`repo_name` is the grain. `branch` and `trigger` are noise for this question, though splitting by `trigger` (push vs cron vs PR) is the natural follow up if main is slow.

#### Step 2: Pick the duration aggregate

`AVG(dur_secs)` matches the prompt verbatim. It silently drops NULL durations (in-flight builds) which is fine here, but build times are long-tailed so the mean lands above the median.

#### Step 3: Decide what counts as a build

If `status` includes `cancelled` and `failed`, your average mixes two populations: real work and abandoned work. Either filter to `status = 'success'` or report the average per status.

---

### The solution

**AVERAGE BUILD DURATION BY REPO**

```sql
SELECT repo_name, AVG(dur_secs) AS avg_duration
FROM ci_builds
GROUP BY repo_name
```

> **Cost Analysis**
>
> Single hash aggregate over 2M rows, one pass, no join. Modern engines finish this in seconds. If `repo_name` cardinality is low (hundreds of repos), the hash table fits in memory trivially and no spill is needed.

> **Interviewers Watch For**
>
> Whether you ask about `status` before writing the query. The CI/CD team wants to find slow repos, not repos with flaky cancellations. Naming the long tail (p95 vs mean) and offering `PERCENTILE_CONT` as a follow up reads as production experience.

> **Common Pitfall**
>
> Writing `SUM(dur_secs) / COUNT(*)` and assuming it equals `AVG`. It does only when no `dur_secs` is NULL. `AVG` divides by non-null count; manual division divides by row count and quietly pulls the mean down when builds are still running.

---

### COMMON FOLLOW-UP QUESTIONS

## Common follow-up questions

- How would you report the 95th percentile build duration per repo instead of the mean? _(Tests whether you reach for `PERCENTILE_CONT` or `APPROX_PERCENTILE` and understand why p95 matters for CI tail latency.)_
- How would you split the average by `status` so we can compare successful vs failed build times? _(Probes conditional aggregation with `AVG(CASE WHEN status = 'success' THEN dur_secs END)`.)_
- How would you flag repos whose average duration jumped week over week? _(Tests window functions over a `DATE_TRUNC('week', built_at)` grouping and `LAG` for the delta.)_

> **In practice**
>
> CI dashboards almost always show median plus p95, not the mean. Means hide bimodal distributions where half the builds finish in 90 seconds and the other half hit a 30 minute integration suite.

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/average_build_duration_by_repo)
- [SQL Interview Questions](https://datadriven.io/sql-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.