# Tables With Many DQ Failures

> Some tables have never once passed QA.

Canonical URL: <https://datadriven.io/problems/tables_with_many_dq_failures>

Domain: SQL · Difficulty: medium · Seniority: L3

## Problem

Our data quality checks table logs pass/fail results per table. Find tables with three or more failed checks, count their total failures, and rank sorted from most failures to least.

## Worked solution and explanation

### Why this problem exists in real interviews

This tests whether a candidate can demonstrate filtering groups after aggregation. This pattern appears frequently in mid-level SQL rounds where interviewers want to see structured thinking.

---

### Break down the requirements

#### Step 1: Apply the WHERE filter

Filter rows before any aggregation. This ensures only qualifying data enters the computation, keeping the result correct and the scan minimal.

#### Step 2: Aggregate by `dc.tbl_name`

`GROUP BY dc.tbl_name` collapses rows to one per group. The aggregate functions (`SUM`, `COUNT`, `AVG`, etc.) compute the metric for each group.

#### Step 3: Filter groups with HAVING

HAVING applies after GROUP BY, filtering out groups that do not meet the threshold. This cannot be done in WHERE because the aggregate has not been computed yet.

#### Step 4: Sort the final output

The `ORDER BY` clause ensures the result appears in the expected sequence. Interviewers check that the sort direction matches the prompt.

---

### The solution

**Group dq_checks by table name with HAVING failure count >= 3**

```sql
SELECT dc.tbl_name, COUNT(*) AS fail_count
FROM dq_checks dc
WHERE dc.passed = 0
GROUP BY dc.tbl_name
HAVING COUNT(*) >= 3
ORDER BY fail_count DESC
```

> **Cost Analysis**
>
> With ~600K rows, the GROUP BY reduces the working set before any downstream operations. An index on the filter/join columns would reduce the scan to a seek.

> **Interviewers Watch For**
>
> Interviewers watch for whether you use HAVING (not WHERE) to filter after aggregation.

> **Common Pitfall**
>
> Putting the aggregate condition in WHERE instead of HAVING causes a syntax error. WHERE runs before GROUP BY; HAVING runs after.

---

## Common follow-up questions

- If the passed column is a boolean and some rows have NULL, should NULLs count as failures? _(Tests NULL semantics; NOT passed would not match NULL, so explicit handling is needed.)_
- What is the difference between ranking with ROW_NUMBER and DENSE_RANK when two tables have the same failure count? _(Tests ranking function choice and its effect on tie handling.)_
- How would you also show the most common failing rule per table alongside the count? _(Tests the ability to add a mode-style aggregation, often requiring a subquery or window function.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/tables_with_many_dq_failures)
- [SQL Interview Questions](https://datadriven.io/sql-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.