# Median Failure Rate by Table

> Half the tables fail more than this.

Canonical URL: <https://datadriven.io/problems/median_failure_rate_by_table>

Domain: SQL · Difficulty: hard · Seniority: L5

## Problem

Compute the median failure percentage for each table being monitored in data quality checks, rounded to the nearest whole number.

## Worked solution and explanation

### Why this problem exists in real interviews

Querying dq_checks for tbl_name data using percentile calculation and grouping tests whether you can translate a business requirement into the right column references and filter sequence. It shows up in senior-level rounds to verify practical fluency.

> **Trick to Solving**
>
> Median in SQL has no built-in aggregate in most engines. The trick is using `PERCENTILE_CONT(0.5)` or a row-counting approach with `ROW_NUMBER`.
> 
> 1. Recognize that `AVG` will not give you the median
> 2. Use `PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY ...)` if available
> 3. Alternatively, rank rows and pick the middle value(s)

---

### Break down the requirements

#### Step 1: Compute the median with PERCENTILE_CONT

`PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY col)` returns the interpolated median. This is the cleanest approach in engines that support ordered-set aggregates.

#### Step 2: Aggregate with the aggregate

Group by the output grain and apply `COUNT()` to compute the metric. The `GROUP BY` must match exactly what the output needs: one row per group key.

#### Step 3: Order the final output

Apply `ORDER BY` as specified to produce the expected row sequence. When tied values exist, add a secondary sort column for determinism.

---

### The solution

**Rounded PERCENTILE_CONT per monitoring target**

```sql
SELECT tbl_name,
    ROUND(PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY fail_pct)) AS median_fail_pct
FROM dq_checks
GROUP BY tbl_name
ORDER BY median_fail_pct DESC
```

> **Cost Analysis**
>
> The query scans 700K rows from `dq_checks`. The window function requires a sort, which is O(n log n). Pre-aggregating reduces the sort input.

> **Interviewers Watch For**
>
> Strong candidates explain their choice of window function (`ROW_NUMBER` vs `RANK` vs `DENSE_RANK`) and why it matches the tie semantics. Familiarity with statistical functions beyond COUNT/SUM/AVG distinguishes senior candidates.

> **Common Pitfall**
>
> Using `ROWS` vs `RANGE` in the window frame produces different results when ties exist. Default to `ROWS` unless you specifically need tie grouping.

---

## Common follow-up questions

- What happens to your result if dq_checks.severity contains NULLs for some rows? _(Tests whether the candidate accounts for NULL behavior in aggregates and comparisons on severity.)_
- If a group in dq_checks has only one row, what does the percentile function return? _(Tests understanding of percentile behavior with minimal data points.)_
- With millions of distinct values in dq_checks.check_id, what index strategy would you use to keep this query performant? _(Tests indexing knowledge specific to high-cardinality columns like check_id.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/median_failure_rate_by_table)
- [SQL Interview Questions](https://datadriven.io/sql-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.