# Net Lines

> Some authors build. Others trim. The net tells the truth.

Canonical URL: <https://datadriven.io/problems/net_lines>

Domain: SQL · Difficulty: medium · Seniority: L3

## Problem

Engineering wants to see who is growing the codebase versus trimming it. Each commit records lines added and lines removed. For each author, compute their net line contribution (total added minus total removed).

## Worked solution and explanation

### Why this problem exists in real interviews

Engineering productivity dashboards routinely surface a per-author net contribution to the codebase. Interviewers use this prompt to confirm you can spot that the sign convention lives in the data already (added is positive, removed is positive but to be subtracted) and that you reach for SUM with arithmetic rather than overcomplicating with CASE. It also checks whether you GROUP BY the right grain on the first try.

---

### Break down the requirements

#### Step 1: Pick the right grouping grain

The grain is one row per author. `GROUP BY author` collapses every commit a person ever wrote into a single row before any arithmetic happens.

#### Step 2: Sum each side, then subtract

`SUM(added)` totals every line the author ever wrote and `SUM(removed)` totals every line they ever deleted. Subtracting the two SUMs (not summing the per-row difference) keeps the math obvious and lets the planner compute both totals in one aggregate pass.

---

### The solution

**Per-author net lines via SUM minus SUM**

```sql
SELECT author, SUM(added) - SUM(removed) FROM repo_commits GROUP BY author
```

> **Cost Analysis**
>
> One full scan of `repo_commits` (4,000,000 rows) feeds a hash aggregate keyed on `author` (180 distinct values), so the aggregate state stays tiny and never spills. There is no index that helps because every row has to be read to compute totals. If the table grew 10x and was partitioned by `commit_at`, partial aggregation per partition followed by a final merge would scale linearly.

> **Interviewers Watch For**
>
> Interviewers want to see SUM(added) - SUM(removed), not SUM(added - removed) wrapped in a CASE. Both happen to give the same answer here, but the first version reads as 'totals, then net' the way the product stakeholder asked for it. They also watch whether you group by `author` alone instead of accidentally dragging `repo_name` or `message` into the GROUP BY.

> **Common Pitfall**
>
> Adding extra columns to the SELECT list (like `repo_name` or `message`) without adding them to the GROUP BY raises an error in strict SQL and silently picks an arbitrary value in SQLite. Either GROUP BY them too (changing the grain to per-author-per-repo) or leave them out.

---

## Common follow-up questions

- How would you handle an author who commits under two different strings in `repo_commits.author`, like 'jane@x.com' and 'Jane Doe'? _(Tests whether the candidate recognizes that the same author may commit under multiple `author` strings (email casing, spelling variations) and that the upstream fix is in identity resolution, not in the SQL.)_
- How would you change the query to show net lines per author per `repo_name` instead of overall? _(Tests grouping at a different grain. The answer is `GROUP BY author, repo_name`, which changes the row count from 180 to up to 180 * 60.)_
- If leadership only wants net lines for the last 90 days, where in the query does the date filter go and why? _(Tests filter-then-aggregate ordering. The right answer is a `WHERE commit_at >= ...` filter before the GROUP BY so the aggregate only sees in-range rows.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/net_lines)
- [SQL Interview Questions](https://datadriven.io/sql-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.