# Average DQ Fail Rate

> Average failure rate, table by table.

Canonical URL: <https://datadriven.io/problems/average_dq_fail_rate>

Domain: SQL · Difficulty: easy · Seniority: L3

## Problem

Several downstream consumers are complaining about bad data but nobody knows which source tables are the worst offenders. Compute the average data quality check fail rate per table, but only surface tables where more than one validation rule has actually been evaluated. Show the table name and its average fail percentage.

## Worked solution and explanation

### Why this problem exists in real interviews

Working with `dq_checks`, this problem isolates conditional branching with CASE combined with division-safe NULLIF guards. The interviewer expects candidates to articulate why `tbl_name`, `col_name`, `rule` matter for correctness before touching the keyboard.

> **Trick to Solving**
>
> Any rate or ratio problem requires **null-safe division**. If the denominator can be zero, the query crashes or returns NULL silently.
> 
> 1. Identify the numerator and denominator conditions
> 2. Use `SUM(CASE WHEN ... THEN 1 ELSE 0 END)` for the numerator
> 3. Wrap the denominator in `NULLIF(..., 0)` to prevent division by zero

---

### Break down the requirements

#### Step 1: Group by `rule`

`GROUP BY rule` produces one output row per distinct value of `rule`.

#### Step 2: Compute the ratio with CASE and NULLIF

The numerator uses `SUM(CASE WHEN passed THEN 1 ELSE 0 END)`. Wrapping the denominator in `NULLIF(COUNT(*), 0)` prevents division by zero.

#### Step 3: Round and order

Use `ROUND(..., 4)` for clean decimal output and sort by rate descending.

---

### The solution

**Case-branch for average dq fail rate**

```sql
SELECT
    rule,
    ROUND(
        1.0 * SUM(CASE WHEN passed = 1 THEN 1 ELSE 0 END)
        / NULLIF(COUNT(*), 0),
        4
    ) AS rate
FROM dq_checks
GROUP BY rule
ORDER BY rate DESC
```

> **Cost Analysis**
>
> The main table has 500K rows. The GROUP BY reduces the row count early, keeping downstream operations cheap.

> **Interviewers Watch For**
>
> Strong candidates state the correct `GROUP BY` grain before writing any SQL, showing they think about the output shape first. Division-by-zero handling is a silent correctness bug; interviewers watch for `NULLIF` or equivalent protection.

> **Common Pitfall**
>
> Selecting a non-aggregated column without including it in `GROUP BY` is the most common error. Some engines reject it; others silently return arbitrary values.

---

## Common follow-up questions

- What happens to your results if `tbl_name` in `dq_checks` contains trailing whitespace or mixed casing? _(Tests awareness of text normalization issues that silently fragment GROUP BY results.)_
- Your CASE expression branches on `tbl_name`. What happens if a new category value appears that none of your WHEN clauses match? _(Tests whether the candidate uses a meaningful ELSE branch or lets unmatched rows silently become NULL.)_
- The `tbl_name` column in `dq_checks` has a zipf distribution, meaning a few values dominate. How does that skew affect your query plan and parallelism? _(Tests understanding of data skew: the optimizer may choose a bad plan when histogram statistics are stale.)_
- Could you express this same logic as a single query without CTEs or subqueries? What readability trade-off does that introduce? _(Tests whether the candidate can flatten nested logic and understands when decomposition aids maintainability.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/average_dq_fail_rate)
- [SQL Interview Questions](https://datadriven.io/sql-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.