# Duplicate Training Runs

> Same model, trained twice.

Canonical URL: <https://datadriven.io/problems/duplicate_training_runs>

Domain: SQL · Difficulty: medium · Seniority: L3

## Problem

The ML platform's training metadata table has duplicate runs inflating compute budgets. Surface models that were trained more than once on the same calendar day with the same framework, showing the model name, framework, training date, and run count.

## Worked solution and explanation

### Why this problem exists in real interviews

ML training runs are expensive, and duplicates waste compute. This tests the `GROUP BY` / `HAVING` pattern and whether you can identify which columns define a "duplicate" run versus a legitimate new version.

---

### Break down the requirements

#### Step 1: Define the duplicate key

In `ml_models`, a duplicate training run means the same `mdl_name` was trained multiple times. The `version` column distinguishes intentional re-training from duplicates.

#### Step 2: Group and count

`GROUP BY mdl_name` with `HAVING COUNT(*) > 1` surfaces models trained more than once.

---

### The solution

**Identify duplicate model training runs**

```sql
SELECT mdl_name, COUNT(*) AS run_count
FROM ml_models
GROUP BY mdl_name
HAVING COUNT(*) > 1
ORDER BY run_count DESC
```

> **Cost Analysis**
>
> Hash aggregation over the ML models table. Low cardinality on `mdl_name` makes this very fast.

> **Interviewers Watch For**
>
> The interviewer checks whether you ask "what defines a duplicate?" before writing the query. Blindly grouping by all columns hides duplicates.

> **Common Pitfall**
>
> Including `version` or `train_at` in the GROUP BY makes every row unique, hiding the duplicates entirely.

---

## Common follow-up questions

- What if two runs have the same name but different accuracy, are those duplicates? _(Tests that business context determines the duplicate definition.)_
- How would you keep only the highest-accuracy run per model? _(Tests ROW_NUMBER with PARTITION BY mdl_name ORDER BY accuracy DESC.)_
- How would you prevent duplicate submissions at the application level? _(Tests unique constraints or application-level deduplication.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/duplicate_training_runs)
- [SQL Interview Questions](https://datadriven.io/sql-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.