# Top Commit Authors by Repo

> Three authors per repo. The top committers.

Canonical URL: <https://datadriven.io/problems/top_commit_authors_by_repo>

Domain: SQL · Difficulty: hard · Seniority: L5

## Problem

Find the top 3 commit authors in each repo by average lines added. In case of a tie, break by author name alphabetically. Show each repo with its best, second-best, and third-best authors. If a position is unfilled, label it accordingly (e.g., 'No second author'). Format each entry as 'author (avg_added)'. Round to 1 decimal place.

## Worked solution and explanation

### Why this problem exists in real interviews

Engineering-productivity teams use this to probe whether you reach for window functions instead of correlated subqueries for top-N-per-group. The other signal: did you add a deterministic tiebreaker? Without one, a 'top author' query returns different authors between runs when two engineers tie on average lines added, and code review hates non-determinism.

---

### Break down the requirements

#### Step 1: Aggregate to (repo, author)

`GROUP BY repo_name, author` with `AVG(added) AS avg_added`. This is the expensive step. Everything downstream runs on the grouped result, not raw rows.

#### Step 2: Rank inside each repo

`ROW_NUMBER() OVER (PARTITION BY repo_name ORDER BY AVG(added) DESC, author ASC)`. The `author ASC` tiebreak is in the prompt; without it the winner shifts between runs.

#### Step 3: Pivot ranks 1, 2, 3 to columns

Conditional aggregation: `MAX(CASE WHEN rn = 1 THEN ... END)`. One row per repo, three named columns. Wrap each in `COALESCE` for the unfilled-slot label.

#### Step 4: Format the cell

Concatenate `author || ' (' || CAST(ROUND(avg_added, 1) AS TEXT) || ')'`. Round before cast so the text already carries one decimal.

---

### The solution

**TOP-N-PER-GROUP WITH PIVOT**

```sql
WITH author_stats AS (
  SELECT repo_name, author, AVG(added) AS avg_added,
         ROW_NUMBER() OVER (PARTITION BY repo_name
                            ORDER BY AVG(added) DESC, author ASC) AS rn
  FROM repo_commits
  GROUP BY repo_name, author
)
SELECT repo_name,
       COALESCE(MAX(CASE WHEN rn = 1 THEN author || ' (' || CAST(ROUND(avg_added, 1) AS TEXT) || ')' END), 'No first author')  AS best_author,
       COALESCE(MAX(CASE WHEN rn = 2 THEN author || ' (' || CAST(ROUND(avg_added, 1) AS TEXT) || ')' END), 'No second author') AS second_best_author,
       COALESCE(MAX(CASE WHEN rn = 3 THEN author || ' (' || CAST(ROUND(avg_added, 1) AS TEXT) || ')' END), 'No third author') AS third_best_author
FROM author_stats
WHERE rn <= 3
GROUP BY repo_name
```

> **Cost Analysis**
>
> The 5M-row scan plus hash aggregate on `(repo_name, author)` is the whole bill. The window function runs over the grouped output (orders of magnitude smaller), so it is effectively free. A composite index on `(repo_name, author, added)` lets the planner stream the aggregate without a sort.

> **Interviewers Watch For**
>
> Two questions before you write a clause: what is the AVG over (all commits or a window like last quarter), and how do you want ties handled? Saying `ROW_NUMBER` picks one arbitrarily, `DENSE_RANK` keeps all tied authors at the same slot. The prompt picked alphabetic tiebreak, which forces `ROW_NUMBER`.

> **Common Pitfall**
>
> Writing `ORDER BY AVG(added) DESC` with no secondary key. Tests pass intermittently because the engine returns tied authors in physical-row order. Always add a deterministic tiebreak (here `author ASC`) when the answer is a single name per slot.

**Wrong (correlated subquery)**

`SELECT ... WHERE 3 > (SELECT COUNT(*) FROM author_stats a2 WHERE a2.repo_name = a.repo_name AND a2.avg_added > a.avg_added)` runs the inner query per outer row. Quadratic in authors-per-repo.

**Right (window function)**

`ROW_NUMBER() OVER (PARTITION BY repo_name ORDER BY AVG(added) DESC, author ASC)` computes all ranks in one pass over the grouped result. Linear, and the planner can use a single sort.

---

### COMMON FOLLOW-UP QUESTIONS

## Common follow-up questions

- What changes if two authors tie on `avg_added` and the spec says keep both? _(Forces a `DENSE_RANK` vs `ROW_NUMBER` discussion and how the pivot breaks when a slot has two names.)_
- How would you restrict the average to commits in the last 90 days? _(Probes whether the candidate filters in the CTE `WHERE` (before the aggregate) versus in the outer query (wrong).)_
- What if you need top 10 per repo, not top 3? _(Tests whether the pivot pattern scales, or whether they pivot to a long format with one row per (repo, rank).)_
- How would you exclude bot authors like `dependabot[bot]` without a hardcoded list? _(Opens up `NOT LIKE '%[bot]'`, a `dim_author` join, or a filter on commit message patterns.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/top_commit_authors_by_repo)
- [SQL Interview Questions](https://datadriven.io/sql-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.