# Early Commit Velocity by Author

> How productive was each author during the first year of a repo's CI pipeline

Canonical URL: <https://datadriven.io/problems/early_commit_velocity_by_author>

Domain: SQL · Difficulty: medium · Seniority: L6

## Problem

Engineering management is studying contributor patterns during the first year of every repo's life on CI. The clock starts at a repo's very first CI build and runs for one year. For each author who committed inside that window, report the typical size of their commits (average lines added) and how long after the first build their commits typically landed.

## Worked solution and explanation

### Why this problem exists in real interviews

Two tables, a per-group anchor date, and a date-window join. The interviewer is checking that you can compute a per-repo anchor (`MIN(built_at)`), join it back to the commits, filter to a 365-day window, and aggregate two distinct metrics in the same `GROUP BY` pass.

> **Trick to Solving**
>
> The early window is **defined per repo**, not globally.
> 
> 1. Anchor: `MIN(built_at)` per `repo_name` from `ci_builds`
> 2. Window: commits whose `commit_at` is within 365 days of that repo's anchor
> 3. Output: per author, two AVGs (lines added, days after anchor)

---

### Break down the requirements

#### Step 1: Anchor: each repo's earliest build

Build a CTE `repo_start` that groups `ci_builds` by `repo_name` and takes `MIN(built_at)` as `first_build`. This is the per-repo anchor date.

#### Step 2: Filter to commits within the first year

Join `repo_commits` to `repo_start` on `repo_name`, then filter `commit_at <= datetime(first_build, '+365 days')`. Use `commit_at` (the actual timestamp), never `message` (the commit message text). SQLite's `datetime(..., '+365 days')` returns the anchor plus 365 days; `julianday(...)` returns a day number you can subtract.

#### Step 3: Aggregate per author

Group by `author` and project two AVGs: `AVG(rc.added)` for average lines added per commit, and `AVG(julianday(rc.commit_at) - julianday(rs.first_build))` for average days after the first build.

---

### The solution

**Per-repo anchor, 365-day window, two AVGs**

```sql
WITH repo_start AS (
    SELECT repo_name, MIN(built_at) AS first_build
    FROM ci_builds
    GROUP BY repo_name
)
SELECT
    rc.author,
    AVG(rc.added) AS avg_lines_added,
    AVG(julianday(rc.commit_at) - julianday(rs.first_build)) AS avg_days_after_first_build
FROM repo_commits rc
JOIN repo_start rs ON rc.repo_name = rs.repo_name
WHERE rc.commit_at <= datetime(rs.first_build, '+365 days')
GROUP BY rc.author
```

> **Time and Space Complexity**
>
> **Time:** O(b + c) where b is `ci_builds` rows and c is `repo_commits` rows. The CTE hash-aggregates `ci_builds` to ~400 rows (one per repo), then a hash join on `repo_name` against the much larger `repo_commits` is cheap.
> 
> **Space:** O(r) for the per-repo anchor table where r is the number of repos.

> **Interviewers Watch For**
>
> Strong candidates ask up front whether the early window is per-repo or global, and they read the column names carefully so they aggregate `commit_at` (timestamp) not `message` (text). They also use `built_at`, the actual build timestamp, rather than `trigger` (the trigger type string).

> **Common Pitfall**
>
> Using a global time window instead of a per-repo one. Repos started at different times, so a fixed calendar window mixes "early" commits from old repos with "late" commits from new repos. The CTE per `repo_name` is what makes the window per-repo.

---

## Common follow-up questions

- How would the query change if early meant the first 30 days instead of a year? _(Tests changing the window length without restructuring the query.)_
- What does this query do for an author who commits to two repos with different first_build dates? _(Tests whether the candidate notices that an author can commit to multiple repos and the per-author AVG mixes those windows.)_
- How would you produce a rolling 30-day commit count per author within their early window? _(Tests using window functions to track velocity over time rather than a single AVG.)_
- Which indexes on `repo_commits` and `ci_builds` would the planner pick up to avoid a full scan? _(Tests indexes for the join + range filter shape.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/early_commit_velocity_by_author)
- [SQL Interview Questions](https://datadriven.io/sql-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.