# Reviewers Per Repo Per Year

> Reviewers per repo, year by year.

Canonical URL: <https://datadriven.io/problems/reviewers_per_repo_per_year>

Domain: SQL · Difficulty: medium · Seniority: L4

## Problem

Some repos share names across orgs, so treat each repo-year combination separately. Show the repo name, year, and reviewer count per group, sorted from most reviewers to least.

## Worked solution and explanation

### Why this problem exists in real interviews

Drawn from a analytics domain, this question centers on date extraction for time bucketing over the `code_reviews` table. The tricky part is handling the `repo_name` column correctly under the given constraints.

---

### Break down the requirements

#### Step 1: Aggregate by `repo_name`

`GROUP BY repo_name, review_year` collapses rows to one per group. The aggregate functions (`SUM`, `COUNT`, `AVG`, etc.) compute the metric for each group.

#### Step 2: Deduplicate the result with DISTINCT

`SELECT DISTINCT` removes duplicate rows from the output. This is necessary when joins or subqueries can produce repeated combinations.

#### Step 3: Sort the final output

The `ORDER BY` clause ensures the result appears in the expected sequence. Interviewers check that the sort direction matches the prompt.

---

### The solution

**Aggregate by `repo_name` to find reviewers per repo per year**

```sql
SELECT repo_name, CAST(strftime('%Y', opened_at) AS INTEGER) AS review_year, COUNT(DISTINCT reviewer) AS reviewer_count
FROM code_reviews
GROUP BY repo_name, review_year
ORDER BY reviewer_count DESC
```

> **Cost Analysis**
>
> With ~500K rows, the GROUP BY reduces the working set before any downstream operations. An index on the filter/join columns would reduce the scan to a seek.

> **Interviewers Watch For**
>
> Interviewers watch for how you handle date arithmetic and whether you account for edge cases like month boundaries; whether you know when DISTINCT is needed and when it masks a logic error.

> **Common Pitfall**
>
> Integer division truncates the result silently. Cast at least one operand to DOUBLE before dividing to get a decimal result.

---

## Common follow-up questions

- If `merged` in `code_reviews` is NULL for some rows, how would your aggregation or join logic be affected? _(Probes understanding of NULL propagation through joins and aggregate functions on `code_reviews.merged`.)_
- If `code_reviews` grew to contain billions of rows, which part of your query would become the bottleneck given the cardinality of `opened_at`? _(Tests ability to identify performance hotspots related to `code_reviews.opened_at` at scale.)_
- If the date column in `code_reviews` spans multiple years, does your date extraction logic still produce correct time buckets? _(Tests whether the candidate accounts for year boundaries in date bucketing.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/reviewers_per_repo_per_year)
- [SQL Interview Questions](https://datadriven.io/sql-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.