# Keep Most Recent Record

> Carbon copies clutter the table. Only the latest matters.

Canonical URL: <https://datadriven.io/problems/keep_most_recent_record>

Domain: SQL · Difficulty: medium · Seniority: L5

## Problem

The users table has duplicate entries from overlapping import jobs, causing downstream fan-outs. Write a query that deduplicates by keeping only the most recently updated record for each user.

## Worked solution and explanation

### What this is really asking

20M rows in `users` with duplicate `user_id` values from overlapping imports. Pick one row per `user_id`, the latest by `signup_date`, and drop the rest before downstream fan-out multiplies the damage.

---

### Break down the requirements

#### Step 1: Partition by user_id

Each `user_id` is its own dedup group. Within the group, rank rows so one row gets the winning slot.

#### Step 2: Order DESC on signup_date

Newest first inside each partition. Ties (same `signup_date`) are non-deterministic with `ROW_NUMBER`; add a tiebreaker if you need stability across reruns.

#### Step 3: Keep rn = 1

Filter to the single top-ranked row per user. `ROW_NUMBER` guarantees exactly one; `RANK` would let ties through.

---

### The solution

**DEDUPE BY LATEST SIGNUP**

```sql
WITH ranked AS (
  SELECT
    user_id,
    username,
    email,
    signup_date,
    ROW_NUMBER() OVER (
      PARTITION BY user_id
      ORDER BY signup_date DESC
    ) AS rn
  FROM users
)
SELECT user_id, username, email, signup_date
FROM ranked
WHERE rn = 1;
```

> **Cost Analysis**
>
> One full pass over 20M rows plus a sort per `user_id` partition. A covering index on `(user_id, signup_date DESC)` lets the planner stream the window without a global sort, which is the difference between minutes and seconds.

> **Interviewers Watch For**
>
> Are you using `ROW_NUMBER` over `RANK` or `DENSE_RANK`? Did you call out tie behavior on `signup_date`? Did you ask whether `signup_date` is the right recency column when the prompt says `most recently updated`?

> **Common Pitfall**
>
> Writing `WHERE rn = 1` in the same SELECT as the window function. The window has not yet been computed at that stage, so the filter errors or, worse on some engines, silently keeps every row. Wrap in a CTE or subquery.

---

### COMMON FOLLOW-UP QUESTIONS

## Common follow-up questions

- Two rows share the same `user_id` and `signup_date`. Which one wins? _(`ROW_NUMBER` picks arbitrarily on ties. Add a deterministic tiebreaker like `ORDER BY signup_date DESC, account_status, username` or the physical ingest timestamp.)_
- How would you delete the losers in place instead of returning the survivors? _(Wrap the ranked CTE in `DELETE FROM users WHERE (user_id, signup_date) IN (SELECT ... WHERE rn > 1)`, or use `DELETE USING` syntax on Postgres. Run it in a transaction with a row count sanity check.)_
- Could `QUALIFY` simplify this? _(On Snowflake, BigQuery, or DuckDB: `SELECT ... FROM users QUALIFY ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY signup_date DESC) = 1`. One pass, no CTE, same plan.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/keep_most_recent_record)
- [SQL Interview Questions](https://datadriven.io/sql-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.