# Median Model Accuracy

> The median accuracy. Not the mean.

Canonical URL: <https://datadriven.io/problems/median_model_accuracy>

Domain: SQL · Difficulty: hard · Seniority: L5

## Problem

Compute the median accuracy for each model (mdl_name). When a model has an even number of records, average the two middle values. Ignore records where accuracy is null. Return results ordered alphabetically by model name.

## Worked solution and explanation

### Why this problem exists in real interviews

This focuses on row numbering and percentile calculation within ml_models, specifically around the mdl_name column. Interviewers present it in senior-level rounds because the edge cases around NULL values and boundary conditions reveal depth of understanding.

> **Trick to Solving**
>
> Median in SQL has no built-in aggregate in most engines. The trick is using `PERCENTILE_CONT(0.5)` or a row-counting approach with `ROW_NUMBER`.
> 
> 1. Recognize that `AVG` will not give you the median
> 2. Use `PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY ...)` if available
> 3. Alternatively, rank rows and pick the middle value(s)

---

### Break down the requirements

#### Step 1: Set up a CTE for the intermediate result

Wrap the first transformation in a `WITH` clause. This names the intermediate result set and keeps the outer query clean.

#### Step 2: Filter to the target rows

Filter for NULL or non-NULL values in the `WHERE` clause. This must happen before aggregation to avoid corrupted results.

#### Step 3: Assign row numbers for deduplication

`ROW_NUMBER() OVER (PARTITION BY ... ORDER BY ... DESC)` tags each row within its group. The outer query filters to `rn = 1` to keep only the target row.

#### Step 4: Aggregate with COUNT/AVG

Group by the output grain and apply `COUNT()` to compute the metric. The `GROUP BY` must match exactly what the output needs: one row per group key.

#### Step 5: Order the final output

Apply `ORDER BY` as specified to produce the expected row sequence. When tied values exist, add a secondary sort column for determinism.

---

### The solution

**Row-counting approach for even/odd median**

```sql
WITH numbered AS (
    SELECT mdl_name, accuracy,
        ROW_NUMBER() OVER (PARTITION BY mdl_name ORDER BY accuracy) AS rn,
        COUNT(*) OVER (PARTITION BY mdl_name) AS cnt
    FROM ml_models
    WHERE accuracy IS NOT NULL
)
SELECT mdl_name,
    AVG(accuracy) AS median_accuracy
FROM numbered
WHERE rn IN (cnt / 2, cnt / 2 + 1)
    OR (cnt % 2 = 1 AND rn = (cnt + 1) / 2)
GROUP BY mdl_name
ORDER BY mdl_name
```

> **Cost Analysis**
>
> The query scans 4K rows from `ml_models`. The window function requires a sort, which is O(n log n). Pre-aggregating reduces the sort input.

> **Interviewers Watch For**
>
> Strong candidates explain their choice of window function (`ROW_NUMBER` vs `RANK` vs `DENSE_RANK`) and why it matches the tie semantics. Explicitly mentioning NULL handling before being asked signals production awareness.

> **Common Pitfall**
>
> NULL values are silently excluded from `COUNT(column)` but included in `COUNT(*)`. Mixing these up produces incorrect totals.

---

## Common follow-up questions

- What happens to your result if ml_models.accuracy contains NULLs for some rows? _(Tests whether the candidate accounts for NULL behavior in aggregates and comparisons on accuracy.)_
- If two rows in ml_models have identical values in the ORDER BY columns, how does your ranking handle the tie? _(Tests understanding of RANK vs DENSE_RANK vs ROW_NUMBER tie-breaking behavior.)_
- If ml_models grows to hundreds of millions of rows, how would you partition or index on train_at to maintain performance? _(Tests partitioning strategy for time-series data in train_at.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/median_model_accuracy)
- [SQL Interview Questions](https://datadriven.io/sql-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.