# Message Length

> Verbose commits. Risky changes?

Canonical URL: <https://datadriven.io/problems/message_length>

Domain: SQL · Difficulty: easy · Seniority: L3

## Problem

The engineering manager believes that longer commit messages correlate with riskier changes and wants data to back it up. For each author, show the average commit message length, the total number of commits, and the average lines of code added per commit. Skip any commits that have no message. Only include authors with more than two commits, listed from longest average message to shortest.

## Worked solution and explanation

### Why this problem exists in real interviews

Extracting insights from repo_commits.repo_name grouped by author via grouping and averaging is the central task. It is used as a fundamentals check to test whether you pick the right aggregation function and partition boundary on the first attempt.

---

### Break down the requirements

#### Step 1: Filter to the target rows

Filter for NULL or non-NULL values in the `WHERE` clause. This must happen before aggregation to avoid corrupted results.

#### Step 2: Aggregate with COUNT/AVG

Group by the output grain and apply `COUNT()` to compute the metric. The `GROUP BY` must match exactly what the output needs: one row per group key.

#### Step 3: Order the final output

Apply `ORDER BY` as specified to produce the expected row sequence. When tied values exist, add a secondary sort column for determinism.

---

### The solution

**Multi-aggregate with NULL-guarded string function**

```sql
SELECT author,
    ROUND(AVG(LENGTH(message)), 2) AS avg_msg_length,
    COUNT(*) AS total_commits,
    ROUND(AVG(added), 2) AS avg_lines_added
FROM repo_commits
WHERE message IS NOT NULL AND LENGTH(message) > 0
GROUP BY author
ORDER BY avg_msg_length DESC
```

> **Cost Analysis**
>
> The query scans 10M rows from `repo_commits`. The aggregation reduces the row count before any downstream processing, which is the key performance lever. CTEs in most engines are optimization fences. For production workloads, consider inlining or materializing the intermediate results.

> **Interviewers Watch For**
>
> Naming the output grain ("one row per X") before writing the GROUP BY shows you think about data shape, not just syntax. Walking through comparison logic step by step, rather than writing it in one pass, demonstrates structured thinking.

> **Common Pitfall**
>
> Placing a filter in `WHERE` instead of `HAVING` (or vice versa) is a common mistake. `WHERE` filters rows before aggregation; `HAVING` filters groups after.

---

## Common follow-up questions

- What happens to your result if repo_commits.message contains NULLs for some rows? _(Tests whether the candidate accounts for NULL behavior in aggregates and comparisons on message.)_
- How would you verify that your aggregation on repo_commits.commit_id is not double-counting due to duplicate rows? _(Tests data quality awareness and deduplication strategies.)_
- With millions of distinct values in repo_commits.commit_id, what index strategy would you use to keep this query performant? _(Tests indexing knowledge specific to high-cardinality columns like commit_id.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/message_length)
- [SQL Interview Questions](https://datadriven.io/sql-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.