# Noisiest Tables by DQ Failures

> The tables that fail the most checks.

Canonical URL: <https://datadriven.io/problems/noisiest_tables_by_dq_failures>

Domain: SQL · Difficulty: medium · Seniority: L3

## Problem

We run data quality checks against various tables. For each table, show the count of failed checks (where passed is false) and the highest failure percentage among those checks. Rank sorted from most failures to least, and skip tables with zero failures.

## Worked solution and explanation

### Why this problem exists in real interviews

The interviewer wants to see you apply grouping to dq_checks.tbl_name while accounting for the distribution of col_name. This surfaces in mid-level screens because small logic errors produce results that look correct at a glance.

---

### Break down the requirements

#### Step 1: Filter to failed checks

`WHERE passed = 0` (or `passed = false`) restricts to checks that failed.

#### Step 2: Group by table name

`GROUP BY tbl_name` aggregates failures per table.

#### Step 3: Compute failure count and max failure percentage

`COUNT(*)` for the number of failed checks and `MAX(fail_pct)` for the worst single check.

#### Step 4: Order by failure count descending

`ORDER BY COUNT(*) DESC` surfaces the noisiest tables first.

---

### The solution

**Filter, group, aggregate, sort**

```sql
SELECT
    tbl_name,
    COUNT(*) AS failure_count,
    MAX(fail_pct) AS max_fail_pct
FROM dq_checks
WHERE passed = 0
GROUP BY tbl_name
ORDER BY failure_count DESC
```

> **Cost Analysis**
>
> The query scans `dq_checks` (500,000 rows). A covering index on the filter and group columns would reduce I/O. At this scale, the full scan is acceptable but becomes costly if the table grows 10x.

> **Interviewers Watch For**
>
> Interviewers evaluate whether you translate the English requirements into the correct SQL clauses on the first attempt. They watch for clean syntax, correct column references, and whether you verify edge cases before declaring the query complete.

> **Common Pitfall**
>
> The most common mistake is misreading the prompt's filtering or grouping requirements. Double-check which columns to group by, which to aggregate, and whether the output should be filtered with `WHERE` (before grouping) or `HAVING` (after grouping).

---

## Common follow-up questions

- What happens to your result if dq_checks.severity contains NULLs for some rows? _(Tests whether the candidate accounts for NULL behavior in aggregates and comparisons on severity.)_
- How would you verify that your aggregation on dq_checks.check_id is not double-counting due to duplicate rows? _(Tests data quality awareness and deduplication strategies.)_
- With millions of distinct values in dq_checks.check_id, what index strategy would you use to keep this query performant? _(Tests indexing knowledge specific to high-cardinality columns like check_id.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/noisiest_tables_by_dq_failures)
- [SQL Interview Questions](https://datadriven.io/sql-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.