# Repos with More Builds Than Commits

> More builds than commits. Something is off.

Canonical URL: <https://datadriven.io/problems/repos_with_more_builds_than_commits>

Domain: SQL · Difficulty: medium · Seniority: L4

## Problem

Looking at CI/CD health, find all repos where the number of builds meets or exceeds the number of commits. Show each repo with its build count, ranked from highest to lowest.

## Worked solution and explanation

### Why this problem exists in real interviews

The core skill being tested is self-join, applied to the `ci_builds` and `repo_commits` tables in a analytics context. Getting columns like `build_id`, `repo_name`, and `commit_id` right is where most candidates slip. The problem layers in HAVING for post-aggregation filtering as well.

---

### Break down the requirements

#### Step 1: Join `ci_builds` to `repo_commits`

The join connects the two tables on their shared key. This brings the columns needed for filtering and aggregation into a single row set.

#### Step 2: Deduplicate the result with DISTINCT

`SELECT DISTINCT` removes duplicate rows from the output. This is necessary when joins or subqueries can produce repeated combinations.

---

### The solution

**Having filter for repos with more builds than**

```sql
SELECT cb.repo_name, COUNT(DISTINCT cb.build_id) AS build_count
FROM ci_builds cb
INNER JOIN repo_commits rc ON cb.repo_name = rc.repo_name
GROUP BY cb.repo_name
HAVING COUNT(DISTINCT cb.build_id) >= COUNT(DISTINCT rc.commit_id)
ORDER BY build_count DESC
```

> **Cost Analysis**
>
> With ~7M rows, the join cost depends on the smaller table's cardinality. An index on the filter/join columns would reduce the scan to a seek.

> **Interviewers Watch For**
>
> Interviewers watch for whether you use HAVING (not WHERE) to filter after aggregation; whether you know when DISTINCT is needed and when it masks a logic error.

> **Common Pitfall**
>
> Putting the aggregate condition in WHERE instead of HAVING causes a syntax error. WHERE runs before GROUP BY; HAVING runs after.

---

## Common follow-up questions

- The `dur_secs` column in `ci_builds` has a 2% null rate. How does your query handle rows where `dur_secs` is NULL, and could that silently change the result count? _(Tests whether the candidate accounts for NULLs in `ci_builds.dur_secs` and understands how aggregates skip NULL values.)_
- If `repo_commits` grew to contain billions of rows, which part of your query would become the bottleneck given the cardinality of `message`? _(Tests ability to identify performance hotspots related to `repo_commits.message` at scale.)_
- If the HAVING threshold in your query changed from a fixed number to a percentile, how would you restructure the query? _(Tests ability to replace static HAVING filters with dynamic subquery-based thresholds.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/repos_with_more_builds_than_commits)
- [SQL Interview Questions](https://datadriven.io/sql-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.