# Repository Commit Ranking

> Lines added tell the story of a repo's ambition.

Canonical URL: <https://datadriven.io/problems/repository_commit_ranking>

Domain: SQL · Difficulty: medium · Seniority: L4

## Problem

Sum the lines added per repository across all commits, then rank them with no gaps in ranking. Show repo name, total lines added, and rank from highest to lowest.

## Worked solution and explanation

### Why this problem exists in real interviews

This challenge asks you to apply DENSE_RANK for gapless ranking to the `repo_commits` table, simulating a real analytics workflow. Pay attention to the `repo_name` column as they drive the aggregation and output.

> **Trick to Solving**
>
> When the prompt mentions 'top N' or 'rank', check whether ties should be included. If yes, use `DENSE_RANK()` instead of `ROW_NUMBER()` or `LIMIT`.
> 
> 1. Spot tie-inclusion language in the prompt
> 2. Aggregate to the correct grain first
> 3. Apply `DENSE_RANK()` over the aggregated result

---

### Break down the requirements

#### Step 1: Aggregate by `repo_name`

`GROUP BY repo_name` collapses rows to one per group. The aggregate functions (`SUM`, `COUNT`, `AVG`, etc.) compute the metric for each group.

#### Step 2: Rank results with a window function

`DENSE_RANK()` assigns consecutive ranks without gaps. Unlike `ROW_NUMBER()`, tied values share the same rank, which matters when the prompt asks for ties.

#### Step 3: Sort the final output

The `ORDER BY` clause ensures the result appears in the expected sequence. Interviewers check that the sort direction matches the prompt.

---

### The solution

**Dense-rank for repository commit ranking**

```sql
SELECT repo_name, SUM(added) AS total_added, DENSE_RANK() OVER (ORDER BY SUM(added) DESC) AS rank
FROM repo_commits
GROUP BY repo_name
ORDER BY rank
```

> **Cost Analysis**
>
> With ~4M rows, the GROUP BY reduces the working set before any downstream operations; the window function runs on the reduced set after filtering and grouping. An index on the filter/join columns would reduce the scan to a seek.

> **Interviewers Watch For**
>
> Interviewers watch for which ranking function you choose and whether you can explain the difference between `ROW_NUMBER`, `RANK`, and `DENSE_RANK`.

> **Common Pitfall**
>
> Using `ROW_NUMBER()` when the prompt requires tie inclusion silently drops tied rows. Read the prompt for language about ties or 'include all'.

---

## Common follow-up questions

- What would happen to your result if `repo_commits.commit_at` contained duplicate values that you did not expect? _(Tests whether the candidate considers data quality issues in `commit_at` and uses DISTINCT or deduplication where needed.)_
- With 4,000,000 distinct values in `repo_commits.commit_id`, how would a composite index on the GROUP BY columns change the execution plan? _(Probes understanding of how cardinality in `commit_id` affects grouping and sort operations.)_
- If you switched from DENSE_RANK to ROW_NUMBER, how would the output change for tied values in the ORDER BY? _(Tests understanding of the behavioral differences between ranking window functions.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/repository_commit_ranking)
- [SQL Interview Questions](https://datadriven.io/sql-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.