# Top Repos by Commit Volume

> The most active repos in the org. No ties left behind.

Canonical URL: <https://datadriven.io/problems/top_repos_by_commit_volume>

Domain: SQL · Difficulty: medium · Seniority: L4

## Problem

Our CI system tracks builds per repository. For each repo that has CI build history, count total commits and rank them. Return repos in the top 5 tiers by commit volume, where tied repos share the same rank. Show the repository name and commit count.

## Worked solution and explanation

### Why this problem exists in real interviews

They're checking whether you read the prompt carefully enough to spot the qualifying clause. The phrase `each repo that has CI build history` is the trap: it forces a join to `ci_builds` even though the metric (commit count) lives entirely in `repo_commits`. If you skip the join, you'll count commits from repos that never ran CI. They also want to see RANK vs DENSE_RANK reasoning for tie tiers.

---

### Break down the requirements

#### Step 1: Filter to repos with CI history

Join `repo_commits` to `ci_builds` on `repo_name`. The join is a filter, not a metric source. EXISTS would also work and dodges the fan-out concern (more on that in pitfalls).

#### Step 2: Count commits per repo

`COUNT(rc.commit_id) AS commit_count` grouped by `rc.repo_name`. Count the PK, not `*`, so you can defend it if the interviewer asks why.

#### Step 3: Rank with ties

`RANK() OVER (ORDER BY COUNT(rc.commit_id) DESC)`. The prompt says tied repos share a rank, so RANK, not ROW_NUMBER. Compute it in the same SELECT as the GROUP BY.

#### Step 4: Cap to top 5 tiers

Wrap in a CTE, then `WHERE rnk <= 5`. Top 5 tiers means up to 5 distinct ranks, which can be more than 5 rows under ties. That's the point of using RANK.

---

### The solution

**TOP REPOS BY COMMIT VOLUME WITH TIE TIERS**

```sql
WITH repo_counts AS (
  SELECT rc.repo_name,
         COUNT(rc.commit_id) AS commit_count,
         RANK() OVER (ORDER BY COUNT(rc.commit_id) DESC) AS rnk
  FROM repo_commits rc
  INNER JOIN ci_builds cb
    ON rc.repo_name = cb.repo_name
  GROUP BY rc.repo_name
)
SELECT repo_name, commit_count
FROM repo_counts
WHERE rnk <= 5
```

> **Cost Analysis**
>
> 15M commits joined to 10M builds on `repo_name` is the expensive step and it fans out. Each commit row matches every build for that repo. Aggregating after the fan-out still gives the right answer because we GROUP BY repo and COUNT commits, but cost scales with commits times avg builds per repo. EXISTS avoids the blow-up.

> **Interviewers Watch For**
>
> Ask aloud: do they want the count of distinct commits, or count weighted by builds? Then ask whether `top 5 tiers` means top 5 ranks or top 5 rows. The prompt says tiers with ties sharing rank, so RANK and `<= 5`. Saying it back to them earns the point.

> **Common Pitfall**
>
> The fan-out trap: writing `COUNT(*)` after the join inflates every commit by the number of builds for that repo. Use `COUNT(rc.commit_id)` or, better, `COUNT(DISTINCT rc.commit_id)`. Or rewrite the join as `WHERE EXISTS (SELECT 1 FROM ci_builds cb WHERE cb.repo_name = rc.repo_name)`.

**Wrong (fan-out inflates count)**

`SELECT rc.repo_name, COUNT(*) FROM repo_commits rc JOIN ci_builds cb ON rc.repo_name = cb.repo_name GROUP BY rc.repo_name` returns commits times builds per repo.

**Right (EXISTS, no fan-out)**

`SELECT rc.repo_name, COUNT(*) FROM repo_commits rc WHERE EXISTS (SELECT 1 FROM ci_builds cb WHERE cb.repo_name = rc.repo_name) GROUP BY rc.repo_name`. One row per commit, filtered.

---

### COMMON FOLLOW-UP QUESTIONS

## Common follow-up questions

- Switch to DENSE_RANK. What changes about the result? _(Probes whether you know RANK skips after ties (1,1,3) and DENSE_RANK doesn't (1,1,2), so `<= 5` returns more rows under DENSE_RANK.)_
- Rewrite without the JOIN using EXISTS. Why might that be faster here? _(Tests whether you spot the fan-out cost and know EXISTS short-circuits per repo.)_
- Add a 90-day window on `commit_at` and `built_at`. Where do the predicates go? _(Checks that you put each predicate on its own table to keep partition pruning effective.)_
- What if `repo_name` is not unique across orgs? _(Probes whether you'd join on a composite key (`org_id, repo_name`) and surface the data model gap.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/top_repos_by_commit_volume)
- [SQL Interview Questions](https://datadriven.io/sql-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.