# Model Training Completion Rate

> How many models finished training?

Canonical URL: <https://datadriven.io/problems/model_training_completion_rate>

Domain: SQL · Difficulty: medium · Seniority: L3

## Problem

For each model in the ML registry, calculate the average accuracy and the completion rate (percentage of training runs where accuracy is not null). Return the model name, average accuracy, and completion rate.

## Worked solution and explanation

### Why this problem exists in real interviews

Working against ml_models, this problem tests grouping and averaging on the mdl_name and version columns. Interviewers use it in mid-level screens because a subtle mis-grouping or filter placement changes the output without raising an error.

> **Trick to Solving**
>
> Mode requires finding the most frequent value. The trick is combining `COUNT` with `ORDER BY DESC LIMIT 1`.
> 
> 1. Group by the target column and count occurrences
> 2. Order by count descending
> 3. Handle ties if the prompt requires it

---

### Break down the requirements

#### Step 1: Read from `ml_models`

The query targets `ml_models` with 7 columns. Identify which columns are needed for the output.

#### Step 2: Aggregate with COUNT/AVG

Group by the output grain and apply `COUNT()` to compute the metric. The `GROUP BY` must match exactly what the output needs: one row per group key.

#### Step 3: Return the result set

Select the required columns with any necessary aliasing or formatting.

---

### The solution

**COUNT(col) vs COUNT(*) for completion rate**

```sql
SELECT mdl_name,
    ROUND(AVG(accuracy), 4) AS avg_accuracy,
    ROUND(COUNT(accuracy) * 100.0 / COUNT(*), 2) AS completion_rate
FROM ml_models
GROUP BY mdl_name
```

> **Cost Analysis**
>
> The query scans 2K rows from `ml_models`.

> **Interviewers Watch For**
>
> Explicitly mentioning NULL handling before being asked signals production awareness. Naming the output grain ("one row per X") before writing the GROUP BY shows you think about data shape, not just syntax.

> **Common Pitfall**
>
> NULL values are silently excluded from `COUNT(column)` but included in `COUNT(*)`. Mixing these up produces incorrect totals.

---

## Common follow-up questions

- What happens to your result if ml_models.accuracy contains NULLs for some rows? _(Tests whether the candidate accounts for NULL behavior in aggregates and comparisons on accuracy.)_
- How would you verify that your aggregation on ml_models.model_id is not double-counting due to duplicate rows? _(Tests data quality awareness and deduplication strategies.)_
- If ml_models grows to hundreds of millions of rows, how would you partition or index on train_at to maintain performance? _(Tests partitioning strategy for time-series data in train_at.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/model_training_completion_rate)
- [SQL Interview Questions](https://datadriven.io/sql-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.