# First and Last Peak Accuracy Dates

> Peak accuracy. When it first hit and when it last did.

Canonical URL: <https://datadriven.io/problems/first_and_last_peak_accuracy_dates>

Domain: SQL · Difficulty: medium · Seniority: L3

## Problem

The ML team wants to know when peak model performance was first achieved and whether it has been replicated since. Find the highest accuracy score across all models, then show the earliest and latest training dates where that accuracy was hit.

## Worked solution and explanation

### Why this problem exists in real interviews

Interviewers use the `ml_models` table here to probe row numbering within partitions combined with nested subqueries. The columns `mdl_name`, `version`, `accuracy` force candidates to reason about the correct grain before writing any aggregation.

---

### Break down the requirements

#### Step 1: Partition by `mdl_name`

`PARTITION BY mdl_name` creates groups. Within each group, `ORDER BY train_at ASC` determines the ranking.

#### Step 2: Filter to rank 1

`WHERE rnk = 1` in the outer query selects the target row per group.

---

### The solution

**Row-number for first last peak accuracy**

```sql
SELECT *
FROM (
    SELECT *,
           ROW_NUMBER() OVER (PARTITION BY mdl_name ORDER BY train_at ASC) AS rnk
    FROM ml_models
) ranked
WHERE rnk = 1
ORDER BY mdl_name
```

> **Cost Analysis**
>
> Window function sorts within each `mdl_name` partition. An index on `(mdl_name, train_at)` avoids a full sort.

> **Interviewers Watch For**
>
> The interviewer checks whether you use ROW_NUMBER (one row) vs. RANK/DENSE_RANK (ties) based on the prompt requirements.

> **Common Pitfall**
>
> Using GROUP BY with MIN(train_at) gives the value but not the other columns. ROW_NUMBER gives the full row.

---

## Common follow-up questions

- The `accuracy` column in `ml_models` has roughly 3% NULLs. How does your query handle those rows, and would the result change if NULLs were replaced with zeros? _(Tests whether the candidate understands how NULLs propagate through aggregation functions and whether their WHERE/JOIN conditions implicitly filter them out.)_
- Your window function uses a default frame. What is the implicit frame, and would switching to ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW change anything? _(Tests knowledge of default window frames (RANGE vs ROWS) and when the distinction matters.)_
- The `mdl_name` column in `ml_models` has a zipf distribution, meaning a few values dominate. How does that skew affect your query plan and parallelism? _(Tests understanding of data skew: the optimizer may choose a bad plan when histogram statistics are stale.)_
- Could you express this same logic as a single query without CTEs or subqueries? What readability trade-off does that introduce? _(Tests whether the candidate can flatten nested logic and understands when decomposition aids maintainability.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/first_and_last_peak_accuracy_dates)
- [SQL Interview Questions](https://datadriven.io/sql-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.