# Heavy Hitters

> Some repos never sleep.

Canonical URL: <https://datadriven.io/problems/heavy_hitters>

Domain: SQL · Difficulty: medium · Seniority: L3

## Problem

The eng director wants to spot repos with higher-than-average activity. Count commits per repo in repo_commits, compare against the overall average commits per repo across the table, and keep repos that exceed it. Return the repo_name and its commit count.

## Worked solution and explanation

### Why this problem exists in real interviews

Finding repos with above-average commit counts tests subquery-based thresholds. The prompt asks to compare each repo's count to the overall average, requiring a two-level query.

> **Trick to Solving**
>
> Compute the global average first, then filter groups that exceed it.
> 
> 1. Count commits per repo with GROUP BY
> 2. Compute the overall average in a subquery or CTE
> 3. Filter with HAVING COUNT(*) > (SELECT AVG ...)

---

### Break down the requirements

#### Step 1: Count commits per repo

`GROUP BY repo_name` with `COUNT(*)` gives per-repo commit counts.

#### Step 2: Compare to the global average

Use `HAVING COUNT(*) > (SELECT AVG(cnt) FROM (...))` or a CTE-based approach.

---

### The solution

**Above-average filtering with subquery threshold**

```sql
WITH repo_counts AS (
    SELECT repo_name, COUNT(*) AS commit_count
    FROM repo_commits
    GROUP BY repo_name
)
SELECT repo_name, commit_count
FROM repo_counts
WHERE commit_count > (SELECT AVG(commit_count) FROM repo_counts)
ORDER BY commit_count DESC
```

> **Cost Analysis**
>
> The CTE materializes once. The subquery for AVG scans the small CTE result. Total cost is one full scan of repo_commits.

> **Interviewers Watch For**
>
> The interviewer checks whether you correctly compute the threshold dynamically rather than hardcoding it.

> **Common Pitfall**
>
> Hardcoding a threshold instead of computing the average from the data fails when the data distribution changes.

---

## Common follow-up questions

- How would you show each repo's count alongside the average? _(Tests adding AVG() OVER() as a window function.)_
- What if you wanted repos above the median instead? _(Tests PERCENTILE_CONT or manual median computation.)_
- How would you compute the percentage of total commits each heavy hitter represents? _(Tests SUM() OVER() for percentage of total.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/heavy_hitters)
- [SQL Interview Questions](https://datadriven.io/sql-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.