# Models With Variable Accuracy

> Accuracy should be stable. These models are not.

Canonical URL: <https://datadriven.io/problems/models_with_variable_accuracy>

Domain: SQL · Difficulty: medium · Seniority: L3

## Problem

For each model in the ML registry, show the minimum and maximum accuracy, but only include models where the two values differ. Results should appear alphabetically by model name.

## Worked solution and explanation

### Why this problem exists in real interviews

The interviewer wants to see you apply HAVING filter and grouping to ml_models.mdl_name while accounting for the distribution of version. This surfaces in mid-level screens because small logic errors produce results that look correct at a glance.

> **Trick to Solving**
>
> Mode requires finding the most frequent value. The trick is combining `COUNT` with `ORDER BY DESC LIMIT 1`.
> 
> 1. Group by the target column and count occurrences
> 2. Order by count descending
> 3. Handle ties if the prompt requires it

---

### Break down the requirements

#### Step 1: Aggregate with MIN/MAX

Group by the output grain and apply `MIN()` to compute the metric. The `GROUP BY` must match exactly what the output needs: one row per group key.

#### Step 2: Filter groups with HAVING

The `HAVING` clause filters after aggregation, unlike `WHERE` which filters before. This is necessary when the condition depends on an aggregate result.

#### Step 3: Order the final output

Apply `ORDER BY` as specified to produce the expected row sequence. When tied values exist, add a secondary sort column for determinism.

---

### The solution

**HAVING inequality for non-constant groups**

```sql
SELECT mdl_name, MIN(accuracy) AS min_accuracy, MAX(accuracy) AS max_accuracy
FROM ml_models
GROUP BY mdl_name
HAVING MIN(accuracy) <> MAX(accuracy)
ORDER BY mdl_name
```

> **Cost Analysis**
>
> The query scans 3K rows from `ml_models`. CTEs in most engines are optimization fences. For production workloads, consider inlining or materializing the intermediate results.

> **Interviewers Watch For**
>
> Breaking complex logic into named CTEs shows the interviewer you prioritize readability and debuggability. Familiarity with statistical functions beyond COUNT/SUM/AVG distinguishes senior candidates.

> **Common Pitfall**
>
> Placing a filter in `WHERE` instead of `HAVING` (or vice versa) is a common mistake. `WHERE` filters rows before aggregation; `HAVING` filters groups after.

---

## Common follow-up questions

- What happens to your result if ml_models.accuracy contains NULLs for some rows? _(Tests whether the candidate accounts for NULL behavior in aggregates and comparisons on accuracy.)_
- What is the difference between filtering in WHERE versus HAVING for this query against ml_models? _(Tests whether the candidate understands pre-aggregation vs post-aggregation filtering.)_
- If ml_models grows to hundreds of millions of rows, how would you partition or index on train_at to maintain performance? _(Tests partitioning strategy for time-series data in train_at.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/models_with_variable_accuracy)
- [SQL Interview Questions](https://datadriven.io/sql-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.