# Most Active Recent Committers

> Who has been writing the most code lately?

Canonical URL: <https://datadriven.io/problems/most_active_recent_committers>

Domain: SQL · Difficulty: medium · Seniority: L3

## Problem

We keep a commits table with author and commit date. Who are the top 10 most active committers over the past 2 years relative to the latest commit in the dataset? Show each author and their commit count, sorted from most active to least.

## Worked solution and explanation

### Why this problem exists in real interviews

Engineering productivity dashboards constantly ask 'who shipped the most recently?' but anchored to the data, not to wall-clock time, because backfills and stale snapshots make `CURRENT_DATE` lie. Interviewers use this prompt to test whether you can build a dynamic cutoff from `MAX(commit_at)` inside a scalar subquery, then apply it as a WHERE filter and finish with the standard top-N aggregation.

---

### Break down the requirements

#### Step 1: Anchor the window to the data, not to today

`commit_at` is TEXT, so use `strftime('%Y', commit_at)` to extract the year and `CAST(... AS INTEGER)` to compare arithmetically. The cutoff is `MAX(year) - 2` computed in a scalar subquery against `repo_commits` itself, which makes the window slide with the data instead of breaking on a stale snapshot.

#### Step 2: Filter rows whose year is in the window

`WHERE CAST(strftime('%Y', commit_at) AS INTEGER) >= (cutoff)` keeps every commit at or after the cutoff year. Because the comparison wraps `commit_at` in `strftime`, the planner cannot use a B-tree index on `commit_at` directly; that is acceptable here because the prompt is correctness-first, not throughput-tuned.

#### Step 3: Group, count, sort, limit

`GROUP BY author` collapses to one row per of the 200 authors with the count of commits in the window. `ORDER BY commit_count DESC` puts the most active first, and `LIMIT 10` returns the top ten. Two columns out: `author` and `commit_count`.

---

### The solution

**Dynamic year cutoff via scalar subquery, then top-10 by author**

```sql
SELECT author, COUNT(*) AS commit_count FROM repo_commits WHERE CAST(strftime('%Y', commit_at) AS INTEGER) >= (SELECT CAST(strftime('%Y', MAX(commit_at)) AS INTEGER) - 2 FROM repo_commits) GROUP BY author ORDER BY commit_count DESC LIMIT 10
```

> **Cost Analysis**
>
> `repo_commits` is 5M rows with no partitioning. The scalar subquery runs once: a single MAX scan over `commit_at` (which the engine may shortcut with an index, otherwise a full scan). The outer query then does a second full scan applying the year filter, hash-aggregates over 200 distinct `author` values, and returns 10 rows. Two scans of a 5M-row table are acceptable; in production you would either materialize the cutoff or partition by year.

> **Interviewers Watch For**
>
> Interviewers note whether you anchor the cutoff to `MAX(commit_at)` from the data (correct) versus `CURRENT_DATE - INTERVAL '2 years'` (which silently breaks on stale dumps). They also watch whether you cast the year to INTEGER before subtracting, and whether you remember `LIMIT 10` rather than returning all 200 authors and assuming the caller will trim.

> **Common Pitfall**
>
> Comparing year strings (`strftime('%Y', commit_at) >= '2022'`) accidentally works for four-digit years but breaks the moment the engine decides to coerce one side to INTEGER. The deeper trap is using `date('now', '-2 years')`: the dataset's latest commit may be from 2024 while `now` is 2026, so the filter excludes everything and the result is empty.

---

## Common follow-up questions

- How would you change the query so the window is the last 24 months relative to `MAX(commit_at)` rather than the last 2 calendar years? _(Tests whether the candidate switches from `strftime('%Y', ...)` arithmetic to a date-shift on `commit_at` itself, e.g. `commit_at >= date(MAX(commit_at), '-24 months')`, and recognizes the difference in boundary semantics.)_
- How would you tie-break two authors with the same `commit_count` so the top-10 ordering is reproducible? _(Tests deterministic ordering. Add `, author ASC` to the `ORDER BY` so collisions break alphabetically; otherwise repeat runs may swap rows 9 and 10.)_
- How would you also report each author's `repo_name` count to surface contributors who span many repos versus those locked to one? _(Tests whether the candidate adds `COUNT(DISTINCT repo_name) AS repo_breadth` alongside `COUNT(*)` without breaking the GROUP BY, and reasons about the 70 distinct `repo_name` values.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/most_active_recent_committers)
- [SQL Interview Questions](https://datadriven.io/sql-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.