# What's in a Name

> Group by the first letter, count the heads, show the share.

Canonical URL: <https://datadriven.io/problems/initial_count>

Domain: SQL · Difficulty: easy · Seniority: L3

## Problem

The support team noticed certain username patterns correlate with ticket volume and wants to verify a hunch. For each first letter of the username, show how many users share that initial and what percentage of the total user base they represent. Round the percentage to one decimal place and show the most common initials first.

## Worked solution and explanation

### Why this problem exists in real interviews

Three skills sit on top of each other in one query. You compute a derived value (the first character of the username), aggregate by it, and then express each group as a percentage of the whole. Interviewers use this exact shape because it forces you to think about what the GROUP BY key actually is when it is not a column you can point at directly, and to compute a ratio without breaking it into two queries.

---

### Break down the requirements

#### Step 1: Extract the initial

Pull the first character with `SUBSTR(username, 1, 1)`. SQLite's `SUBSTR(text, start, length)` is 1-indexed; `LEFT(username, 1)` works in Postgres or MySQL but not in SQLite.

#### Step 2: Group by the derived expression

GROUP BY the same expression. Aliases defined in the SELECT list are not visible inside GROUP BY in standard SQL, so repeat the `SUBSTR(...)` call. SQLite tolerates the alias, but most interview engines do not.

#### Step 3: Compute the share

Use a scalar subquery `(SELECT COUNT(*) FROM users)` as the denominator and multiply the numerator by `100.0` so the division is floating-point, not integer. `ROUND(..., 1)` enforces the one-decimal rule from the prompt.

#### Step 4: Sort most-common first

`ORDER BY user_count DESC` puts the most common initials first. Add a tiebreak (e.g. `, initial`) if you want a stable order across runs; here the prompt is silent so the simple ORDER BY is fine.

---

### The solution

**Scalar subquery for the denominator**

```sql
SELECT
    SUBSTR(username, 1, 1) AS initial,
    COUNT(*) AS user_count,
    ROUND(100.0 * COUNT(*) / (SELECT COUNT(*) FROM users), 1) AS pct
FROM users
GROUP BY SUBSTR(username, 1, 1)
ORDER BY user_count DESC
```

**Window function for the denominator**

```sql
SELECT
    SUBSTR(username, 1, 1) AS initial,
    COUNT(*) AS user_count,
    ROUND(100.0 * COUNT(*) / SUM(COUNT(*)) OVER (), 1) AS pct
FROM users
GROUP BY SUBSTR(username, 1, 1)
ORDER BY user_count DESC
```

> **Cost analysis**
>
> Two passes over `users`: one for the GROUP BY, one for the scalar subquery. The window-function variant does it in a single pass, which matters once the table is hundreds of millions of rows. Either form is acceptable in an interview; the window-function answer signals stronger fluency.

> **Interviewers watch for**
>
> Watch how the candidate handles the percentage. Two-query answers (running a separate `SELECT COUNT(*)` first and pasting the number into the main query) are a red flag for a senior role; the whole point of a scalar subquery or window function is to keep the result deterministic against a moving table.

> **Common pitfall**
>
> Integer division. `COUNT(*) / (SELECT COUNT(*) FROM users)` returns 0 for every group because both operands are integers. Force float math by multiplying by `100.0` (note the decimal) before dividing, or by casting one side with `CAST(... AS REAL)`.

---

## Common follow-up questions

- If `users.username` could be NULL, how does that affect your `SUBSTR` and your percentage calculation? _(Tests whether the candidate handles NULL usernames correctly. `COUNT(*)` counts NULL groups, `SUBSTR(NULL, 1, 1)` returns NULL, and the resulting NULL initial sorts at the end (or beginning) depending on engine.)_
- Some usernames are uppercase and some lowercase. Should `Alice` and `alice` count as the same initial, and how would you change the query to enforce that? _(Tests understanding of case sensitivity in GROUP BY and how locale/collation choices change the answer.)_
- Rewrite this without the scalar subquery in the SELECT list. What does that change about how many times the engine reads the table? _(Tests whether the candidate can swap a scalar subquery for an analytic function and explain the tradeoff.)_
- If you sum the `pct` column across all rows, are you guaranteed to get exactly 100.0? Why or why not? _(Tests numeric reasoning around rounding. Naive `ROUND(..., 1)` does not guarantee the column sums to 100.0 because of independent rounding per row.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/initial_count)
- [SQL Interview Questions](https://datadriven.io/sql-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.