# The Duplicate Detection Sprint

> Same email, different rows. Spot the repeats.

Canonical URL: <https://datadriven.io/problems/the_duplicate_detection_sprint>

Domain: SQL · Difficulty: easy · Seniority: L3

## Problem

Ahead of a CRM migration, the data quality team needs to flag duplicate email addresses. For each email that appears more than once in the user records, show the email, how many times it appears, and the earliest and most recent signup dates.

## Worked solution and explanation

### Why this problem exists in real interviews

Duplicate detection via `GROUP BY` and `HAVING` is a fundamental SQL pattern. Interviewers use this to confirm you understand aggregation-then-filter semantics, and that you can extract boundary values (`MIN`, `MAX`) alongside counts in a single pass.

---

### Break down the requirements

#### Step 1: Group by email

`GROUP BY email` collapses the user rows so each email address becomes one output row with aggregate values.

#### Step 2: Filter to duplicates

`HAVING COUNT(*) > 1` keeps only emails that appear more than once.

#### Step 3: Extract date boundaries

`MIN(signup_date)` and `MAX(signup_date)` in the SELECT show when the first and last accounts with that email were created.

---

### The solution

**Group, count, and filter for duplicates**

```sql
SELECT
    email,
    COUNT(*) AS occurrence_count,
    MIN(signup_date) AS earliest_signup,
    MAX(signup_date) AS latest_signup
FROM users
GROUP BY email
HAVING COUNT(*) > 1
ORDER BY occurrence_count DESC
```

> **Cost Analysis**
>
> A single sequential scan of the users table feeds the hash aggregate. If an index exists on `email`, the planner may use an index-only scan. The HAVING filter is applied after aggregation, so its cost is negligible relative to the scan.

> **Interviewers Watch For**
>
> This is a warm-up question, so interviewers watch for speed and precision. Hesitating on whether the count filter goes in WHERE vs. HAVING is a red flag at any level. Strong candidates write this in under two minutes.

> **Common Pitfall**
>
> Case sensitivity: if emails are stored with mixed case (`User@Example.com` vs `user@example.com`), you need `LOWER(email)` in the GROUP BY to catch true duplicates. The prompt does not mention this, but raising it proactively shows attention to data quality.

---

## Common follow-up questions

- How would you handle case-insensitive duplicate detection? _(Tests LOWER() in GROUP BY and awareness of collation settings.)_
- What if you also needed to return the user_ids of the duplicate accounts? _(Tests GROUP_CONCAT or a self-join approach to list IDs per duplicate group.)_
- How would you deduplicate and keep only the earliest account? _(Tests DELETE with a subquery or ROW_NUMBER partitioned by email.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/the_duplicate_detection_sprint)
- [SQL Interview Questions](https://datadriven.io/sql-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.