# Busy Authors

> Some developers spread their commits everywhere.

Canonical URL: <https://datadriven.io/problems/busy_authors>

Domain: SQL · Difficulty: medium · Seniority: L3

## Problem

The engineering manager values developers who contribute across multiple codebases rather than siloing in one repo. Find every author who has committed to more than one repository and show how many repos they've touched.

## Worked solution and explanation

### Why this problem exists in real interviews

Working with `repo_commits`, this problem isolates post-aggregation filtering with HAVING. The interviewer expects candidates to articulate why `repo_name`, `author`, `message` matter for correctness before touching the keyboard.

---

### Break down the requirements

#### Step 1: Group by `author`

`GROUP BY` at the correct grain produces one row per group.

#### Step 2: Compute `COUNT(*)`

The COUNT function counts rows per group.

#### Step 3: Apply the HAVING threshold

Post-aggregation filter keeps only groups meeting the minimum.

#### Step 4: Order by the metric

Sort by `cnt` desc for readability.

---

### The solution

**Having-filter for busy authors**

```sql
SELECT
    author,
    COUNT(*) AS cnt
FROM repo_commits
GROUP BY author
HAVING COUNT(*) >= 1
ORDER BY cnt DESC
```

> **Cost Analysis**
>
> The main table has 3M rows (1 GB). The GROUP BY reduces the row count early, keeping downstream operations cheap.

> **Interviewers Watch For**
>
> Strong candidates state the correct `GROUP BY` grain before writing any SQL, showing they think about the output shape first.

> **Common Pitfall**
>
> Selecting a non-aggregated column without including it in `GROUP BY` is the most common error. Some engines reject it; others silently return arbitrary values.

---

## Common follow-up questions

- What happens to your results if `commit_id` in `repo_commits` contains trailing whitespace or mixed casing? _(Tests awareness of text normalization issues that silently fragment GROUP BY results.)_
- Could you move the HAVING condition into a WHERE clause on a subquery instead? What are the trade-offs? _(Tests understanding of filter push-down: WHERE filters before grouping, HAVING after.)_
- `commit_id` in `repo_commits` has ~3M distinct values. What index strategy keeps your query from doing a full table scan? _(Tests whether the candidate can design indexes for high-cardinality columns and understands selectivity.)_
- Could you express this same logic as a single query without CTEs or subqueries? What readability trade-off does that introduce? _(Tests whether the candidate can flatten nested logic and understands when decomposition aids maintainability.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/busy_authors)
- [SQL Interview Questions](https://datadriven.io/sql-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.