# Not From Around Here

> The data is mixed. Only some of it belongs.

Canonical URL: <https://datadriven.io/problems/filter_by_domain>

Domain: SQL · Difficulty: easy · Seniority: L3

## Problem

The security team flagged the '@example.com' domain for suspicious signup activity. Pull all users whose email ends with '@example.com', returning their user_id, username, and full email address.

## Worked solution and explanation

### Why this problem exists in real interviews

The core skill being tested is row numbering within partitions combined with nested subqueries over `users`. Candidates must decide how `username`, `email` interact before choosing a join strategy or aggregation level.

---

### Break down the requirements

#### Step 1: Partition by `username`

`PARTITION BY username` creates groups. Within each group, `ORDER BY user_id DESC` determines the ranking.

#### Step 2: Filter to rank 1

`WHERE rnk = 1` in the outer query selects the target row per group.

---

### The solution

**Row-number for filter domain**

```sql
SELECT *
FROM (
    SELECT *,
           ROW_NUMBER() OVER (PARTITION BY username ORDER BY user_id DESC) AS rnk
    FROM users
) ranked
WHERE rnk = 1
ORDER BY username
```

> **Cost Analysis**
>
> Window function sorts within each `username` partition. An index on `(username, user_id)` avoids a full sort.

> **Interviewers Watch For**
>
> The interviewer checks whether you use ROW_NUMBER (one row) vs. RANK/DENSE_RANK (ties) based on the prompt requirements.

> **Common Pitfall**
>
> Using GROUP BY with MIN(user_id) gives the value but not the other columns. ROW_NUMBER gives the full row.

---

## Common follow-up questions

- What happens to your results if `username` in `users` contains trailing whitespace or mixed casing? _(Tests awareness of text normalization issues that silently fragment GROUP BY results.)_
- Your window function uses a default frame. What is the implicit frame, and would switching to ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW change anything? _(Tests knowledge of default window frames (RANGE vs ROWS) and when the distinction matters.)_
- `user_id` in `users` has ~10M distinct values. What index strategy keeps your query from doing a full table scan? _(Tests whether the candidate can design indexes for high-cardinality columns and understands selectivity.)_
- Could you express this same logic as a single query without CTEs or subqueries? What readability trade-off does that introduce? _(Tests whether the candidate can flatten nested logic and understands when decomposition aids maintainability.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/filter_by_domain)
- [SQL Interview Questions](https://datadriven.io/sql-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.