# Zero Accuracy on First Training

> First run. Zero accuracy. How common?

Canonical URL: <https://datadriven.io/problems/zero_accuracy_on_first_training>

Domain: SQL · Difficulty: medium · Seniority: L4

## Problem

What percentage of models had an accuracy of 0 on their very first training run? Only consider models with a non-null status, as null status means the training was cancelled. Return a single percentage.

## Worked solution and explanation

### Why this problem exists in real interviews

This tests combining a per-group first-row identification with conditional aggregation. Interviewers check whether you can find each model's first training run, filter by a condition, and compute a percentage.

---

### Break down the requirements

#### Step 1: Find each model's first training run

Use `ROW_NUMBER() OVER (PARTITION BY model_id ORDER BY train_at)` to rank runs chronologically. Filter to `rn = 1` for the first run.

#### Step 2: Exclude cancelled trainings

`WHERE status IS NOT NULL` removes rows where the training was cancelled.

#### Step 3: Compute percentage with zero accuracy

Count rows where `accuracy = 0` divided by total first-run rows, multiplied by 100.

---

### The solution

**First-run identification with conditional percentage**

```sql
SELECT
    SUM(CASE WHEN accuracy = 0 THEN 1 ELSE 0 END) * 100.0 / COUNT(*) AS zero_accuracy_pct
FROM (
    SELECT
        model_id,
        accuracy,
        ROW_NUMBER() OVER (PARTITION BY model_id ORDER BY train_at) AS rn
    FROM ml_models
    WHERE status IS NOT NULL
) first_runs
WHERE rn = 1
```

> **Cost Analysis**
>
> With 3K rows, the window function sort within ~120 model partitions is trivial. The entire query completes in milliseconds.

> **Interviewers Watch For**
>
> Whether the NULL status filter is applied before ROW_NUMBER. Cancelled runs should not be considered as the 'first' run. Filtering before ranking ensures the first non-cancelled run is selected.

> **Common Pitfall**
>
> Filtering `WHERE accuracy = 0` before finding the first run. This would only find models whose zero-accuracy run happens to be first, missing models whose first run had non-zero accuracy.

---

## Common follow-up questions

- What if two training runs have the same timestamp for a model? _(ROW_NUMBER is non-deterministic; add model_id or another column as a tiebreaker.)_
- How would you find which models improved from zero accuracy on their first run? _(Compare first run accuracy to the most recent run accuracy using window functions.)_
- What if accuracy is NULL for the first run? _(NULL != 0, so those models would not be counted as zero accuracy. Clarify business rules for NULL accuracy.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/zero_accuracy_on_first_training)
- [SQL Interview Questions](https://datadriven.io/sql-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.