# Long Messages

> Some commit messages tell a novel.

Canonical URL: <https://datadriven.io/problems/long_messages>

Domain: SQL · Difficulty: medium · Seniority: L5

## Problem

The code quality team suspects some commits are low-effort one-liners. Find every commit whose message is longer than 10 characters, skipping rows with no message. Return the author, message, and its length.

## Worked solution and explanation

### Why this problem exists in real interviews

Querying repo_commits for repo_name data using query construction tests whether you can translate a business requirement into the right column references and filter sequence. It shows up in mid-level screens to verify practical fluency.

> **Trick to Solving**
>
> Read the prompt carefully for implicit constraints. The phrase structure hints at the grain of the output: what each row represents.
> 
> 1. Identify the output grain from the prompt (one row per what?)
> 2. Work backward from the desired output columns
> 3. Build the query inside-out: innermost subquery first, then layer on filters and aggregates

---

### Break down the requirements

#### Step 1: Filter to the target rows

Filter for NULL or non-NULL values in the `WHERE` clause. This must happen before aggregation to avoid corrupted results.

#### Step 2: Order the final output

Apply `ORDER BY` as specified to produce the expected row sequence. When tied values exist, add a secondary sort column for determinism.

---

### The solution

**LENGTH filter with NULL guard**

```sql
SELECT author, message, LENGTH(message) AS msg_length
FROM repo_commits
WHERE message IS NOT NULL
    AND LENGTH(message) > 10
ORDER BY msg_length DESC
```

> **Cost Analysis**
>
> The query scans 3M rows from `repo_commits`. CTEs in most engines are optimization fences. For production workloads, consider inlining or materializing the intermediate results.

> **Interviewers Watch For**
>
> Breaking complex logic into named CTEs shows the interviewer you prioritize readability and debuggability.

> **Common Pitfall**
>
> LIKE is case-sensitive in most SQL dialects. If the prompt does not specify case, use `ILIKE` or `LOWER()` to avoid missing matches.

---

## Common follow-up questions

- If repo_commits.commit_id could contain unexpected NULL values, how would your query behave? _(Tests NULL awareness even when the schema does not currently allow NULLs in commit_id.)_
- How would you verify that your aggregation on repo_commits.added is not double-counting due to duplicate rows? _(Tests data quality awareness and deduplication strategies.)_
- With millions of distinct values in repo_commits.commit_id, what index strategy would you use to keep this query performant? _(Tests indexing knowledge specific to high-cardinality columns like commit_id.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/long_messages)
- [SQL Interview Questions](https://datadriven.io/sql-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.