# Median Null Percentage of Float Features

> Nulls in float columns. How widespread?

Canonical URL: <https://datadriven.io/problems/median_null_percentage_of_float_features>

Domain: SQL · Difficulty: medium · Seniority: L4

## Problem

In the ML feature store, compute the median null percentage across all features that have a float-related data type (any dtype containing 'float').

## Worked solution and explanation

### Why this problem exists in real interviews

This focuses on percentile calculation and pattern matching within ml_features, specifically around the feat_name column. Interviewers present it in mid-level screens because the edge cases around NULL values and boundary conditions reveal depth of understanding.

> **Trick to Solving**
>
> Median in SQL has no built-in aggregate in most engines. The trick is using `PERCENTILE_CONT(0.5)` or a row-counting approach with `ROW_NUMBER`.
> 
> 1. Recognize that `AVG` will not give you the median
> 2. Use `PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY ...)` if available
> 3. Alternatively, rank rows and pick the middle value(s)

---

### Break down the requirements

#### Step 1: Filter to the target rows

Apply the `LIKE` pattern match in the `WHERE` clause. This narrows the dataset before any grouping or aggregation.

#### Step 2: Compute the median with PERCENTILE_CONT

`PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY col)` returns the interpolated median. This is the cleanest approach in engines that support ordered-set aggregates.

#### Step 3: Order the final output

Apply `ORDER BY` as specified to produce the expected row sequence. When tied values exist, add a secondary sort column for determinism.

---

### The solution

**Filtered PERCENTILE_CONT for single-value median**

```sql
SELECT PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY null_pct) AS median_null_pct
FROM ml_features
WHERE dtype LIKE '%float%'
```

> **Cost Analysis**
>
> The query scans 15M rows from `ml_features`. The window function requires a sort, which is O(n log n). Pre-aggregating reduces the sort input.

> **Interviewers Watch For**
>
> Strong candidates explain their choice of window function (`ROW_NUMBER` vs `RANK` vs `DENSE_RANK`) and why it matches the tie semantics. Explicitly mentioning NULL handling before being asked signals production awareness.

> **Common Pitfall**
>
> A WHERE clause with `= NULL` matches nothing. Use `IS NULL` instead. This is one of the most common SQL mistakes in interviews.

---

## Common follow-up questions

- What happens to your result if ml_features.avg_val contains NULLs for some rows? _(Tests whether the candidate accounts for NULL behavior in aggregates and comparisons on avg_val.)_
- If a group in ml_features has only one row, what does the percentile function return? _(Tests understanding of percentile behavior with minimal data points.)_
- With millions of distinct values in ml_features.feat_id, what index strategy would you use to keep this query performant? _(Tests indexing knowledge specific to high-cardinality columns like feat_id.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/median_null_percentage_of_float_features)
- [SQL Interview Questions](https://datadriven.io/sql-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.