# The Freshest Record

> Duplicates everywhere. Only the most recent version of the truth survives.

Canonical URL: <https://datadriven.io/problems/deduplicate_and_keep_latest>

Domain: SQL · Difficulty: medium · Seniority: L5

## Problem

The observability team is cleaning up server logs ahead of a weekly review. Same server emitting the same message multiple times should appear only once, keeping the most recent occurrence. After that, narrow the result to entries from the last 7 days. Show every column from the surviving rows.

## Worked solution and explanation

### Why this problem exists in real interviews

Two operations stacked: deduplicate per (server_name, message) and recency-filter the result. The interviewer is checking that you reach for `ROW_NUMBER` (the portable choice over Postgres-only `DISTINCT ON`) and that you put the recency filter **after** dedup, not before. Order matters: filtering first can change which row wins the partition.

> **Trick to Solving**
>
> Two stages, in this order:
> 
> 1. CTE: `ROW_NUMBER() OVER (PARTITION BY server_name, message ORDER BY log_timestamp DESC)` to label rows; rn=1 is the latest row per group
> 2. Outer: `WHERE rn = 1 AND <recency predicate>` to keep only the latest-per-group AND only the last 7 days

---

### Break down the requirements

#### Step 1: Number rows within each (server_name, message) partition

Build a CTE `ranked` that wraps `server_logs` with `ROW_NUMBER() OVER (PARTITION BY server_name, message ORDER BY log_timestamp DESC) AS rn`. The latest row in each group gets `rn = 1`.

#### Step 2: Filter to rn=1 AND last 7 days

In the outer SELECT, keep `rn = 1` to drop the older duplicates, then apply the recency filter. SQLite has no `INTERVAL` keyword, so express the 7-day window with `julianday`: `julianday(date('now')) - julianday(log_timestamp) <= 7`.

#### Step 3: Project the original columns

Project the original 6 columns explicitly (skip `rn`): `log_id, server_name, log_level, message, response_time_ms, log_timestamp`.

---

### The solution

**ROW_NUMBER dedup, then SQLite-friendly recency filter**

```sql
WITH ranked AS (
    SELECT *,
        ROW_NUMBER() OVER (
            PARTITION BY server_name, message
            ORDER BY log_timestamp DESC
        ) AS rn
    FROM server_logs
)
SELECT log_id, server_name, log_level, message, response_time_ms, log_timestamp
FROM ranked
WHERE rn = 1
  AND julianday(date('now')) - julianday(log_timestamp) <= 7
```

> **Time and Space Complexity**
>
> **Time:** O(n log n) for the in-partition sort across 80M rows. The hash partitioner builds groups of (server_name, message), and within each, the sort is the dominant step. A composite index on (server_name, message, log_timestamp DESC) would reduce the sort to a single descending scan.
> 
> **Space:** O(g) for the partition state where g is the number of distinct (server_name, message) pairs.

> **Interviewers Watch For**
>
> Strong candidates dedup first, filter second. They also reach for `julianday` (or the engine's equivalent date arithmetic) instead of Postgres-only `INTERVAL`, since the grader runs on SQLite.

> **Common Pitfall**
>
> Writing `WHERE log_timestamp >= CURRENT_DATE - INTERVAL '7 days'`. That's Postgres syntax and the SQLite grader will reject it. Use `julianday(date('now')) - julianday(log_timestamp) <= 7` or `log_timestamp >= date('now', '-7 days')`. The other trap: applying that filter inside the CTE instead of the outer SELECT, which can change which row wins the partition.

---

## Common follow-up questions

- If you applied the 7-day filter inside the CTE, what could go wrong with the dedup result? _(Tests understanding of how partition contents shift when you filter before vs after dedup.)_
- How would you write this with `DISTINCT ON (server_name, message)` on Postgres? _(Tests Postgres-specific dedup syntax as a more concise alternative.)_
- What changes if you needed to keep the earliest row per group instead of the latest? _(Tests changing the rank ordering for the inverse problem.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/deduplicate_and_keep_latest)
- [SQL Interview Questions](https://datadriven.io/sql-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.