# Latest Commit Build Cost

> The latest commit came with a build cost.

Canonical URL: <https://datadriven.io/problems/latest_commit_build_cost>

Domain: SQL · Difficulty: medium · Seniority: L5

## Problem

We need a CI cost attribution model. For each commit author, identify the repo they most recently committed to that also has CI build history. Then total up all build durations for that repo. Return each author and their total build seconds, alphabetically by author.

## Worked solution and explanation

### What this is really asking

Per author, pick one repo (the one they touched most recently that also appears in ci_builds), then sum every dur_secs row for that repo. The author barely matters in the SUM; the repo carries the cost.

---

### Break down the requirements

#### Step 1: Filter to repos with CI history

Restrict repo_commits to rows where repo_name appears in ci_builds. A subquery IN list reads cleanly here, but a semi-join (EXISTS or INNER JOIN on distinct repos) does the same work.

#### Step 2: Pick latest commit per author

ROW_NUMBER() OVER (PARTITION BY author ORDER BY commit_at DESC). Keep rn = 1. You need repo_name on the same row, so a windowed dedup, not a GROUP BY MAX, is the right shape.

#### Step 3: Sum build durations for that repo

Join the (author, repo_name) winner back to ci_builds on repo_name, SUM(dur_secs), GROUP BY author, ORDER BY author.

---

### The solution

**LATEST COMMIT BUILD COST**

```sql
WITH latest_commits AS (
  SELECT author, repo_name,
         ROW_NUMBER() OVER (PARTITION BY author ORDER BY commit_at DESC) AS rn
  FROM repo_commits
  WHERE repo_name IN (SELECT repo_name FROM ci_builds)
)
SELECT lc.author, SUM(cb.dur_secs) AS total_build_seconds
FROM latest_commits lc
JOIN ci_builds cb ON lc.repo_name = cb.repo_name
WHERE lc.rn = 1
GROUP BY lc.author
ORDER BY lc.author;
```

> **Cost Analysis**
>
> 5M commits partitioned by author and sorted by commit_at, so the window scan dominates. The semi-join to ci_builds prunes early. Final join hits ci_builds (3M) once per winning repo. Index commit_at and repo_name.

> **Interviewers Watch For**
>
> Whether you notice that total_build_seconds is per repo, not per author. Two authors whose latest repo is the same repo will both report identical totals. Call that out before they ask.

> **Common Pitfall**
>
> Putting WHERE rn = 1 in the same SELECT as the ROW_NUMBER call. Window functions resolve after WHERE, so the predicate sees nothing. Wrap in a CTE (or subquery) and filter outside, as shown.

> **The False Start**
>
> First instinct is GROUP BY author with MAX(commit_at), then re-join to repo_commits to pull repo_name. Symptom: ties on commit_at duplicate authors and the join fans out. Pivot to ROW_NUMBER which keeps repo_name on the winning row and breaks ties deterministically.

---

### COMMON FOLLOW-UP QUESTIONS

## Common follow-up questions

- How would you break ties when an author has two commits at the exact same commit_at on different repos? _(ROW_NUMBER picks arbitrarily; add a secondary ORDER BY (repo_name, commit_id) to make it deterministic.)_
- What if we only want builds triggered after that latest commit? _(Pass commit_at through the CTE and add cb.built_at >= lc.commit_at to the join predicate.)_
- How would you attribute cost when the author also committed to other CI repos earlier? _(Switch from picking one repo to summing dur_secs across every (author, repo) pair the author touched, weighted by commit share.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/latest_commit_build_cost)
- [SQL Interview Questions](https://datadriven.io/sql-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.