# Yearly Build Duration by Repo

> Build times by repo, year by year.

Canonical URL: <https://datadriven.io/problems/yearly_build_duration_by_repo>

Domain: SQL · Difficulty: medium · Seniority: L3

## Problem

For each repository and year, show the average CI build duration, sorted from slowest average to fastest.

## Worked solution and explanation

### Why this problem exists in real interviews

This tests date extraction combined with multi-column GROUP BY and ordering. Interviewers check whether you can aggregate metrics across two dimensions (repo and year) and sort by the aggregate.

---

### Break down the requirements

#### Step 1: Extract year from build timestamp

`STRFTIME('%Y', built_at)` extracts the calendar year from each build's timestamp.

#### Step 2: Compute average duration per repo per year

`GROUP BY repo_name, year` with `AVG(dur_secs)` computes the mean build duration for each combination.

#### Step 3: Sort slowest to fastest

`ORDER BY avg_duration DESC` surfaces the slowest repo-year combinations first.

---

### The solution

**Multi-dimension aggregation sorted by performance**

```sql
SELECT
    repo_name,
    STRFTIME('%Y', built_at) AS year,
    AVG(dur_secs) AS avg_duration
FROM ci_builds
GROUP BY repo_name, STRFTIME('%Y', built_at)
ORDER BY avg_duration DESC
```

> **Cost Analysis**
>
> Full scan of 3M rows. The GROUP BY produces roughly 60 repos x N years rows (maybe 180 total). The AVG and sort on this small result set are trivial.

> **Interviewers Watch For**
>
> Handling NULL dur_secs values (2% null fraction). AVG ignores NULLs by default, which is usually the desired behavior for build duration.

> **Common Pitfall**
>
> Using SUM(dur_secs) / COUNT(*) instead of AVG(dur_secs). If dur_secs has NULLs, COUNT(*) counts all rows but SUM ignores NULLs, giving a lower average than AVG which only divides by non-NULL count.

---

## Common follow-up questions

- How would you add the P95 build duration? _(Use PERCENTILE_CONT(0.95) or approximate with a subquery and NTILE.)_
- What if you only wanted repos that got slower year over year? _(Compare consecutive years with LAG on the avg_duration.)_
- How would you filter to only successful builds? _(Add WHERE status = 'success' before grouping.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/yearly_build_duration_by_repo)
- [SQL Interview Questions](https://datadriven.io/sql-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.