# Worst Table Per Year by DQ Failures

> Every year has a worst table.

Canonical URL: <https://datadriven.io/problems/worst_table_per_year_by_dq_failures>

Domain: SQL · Difficulty: hard · Seniority: L4

## Problem

For each year, surface the table with the most failed data quality checks and its failure count.

## Worked solution and explanation

### Why this problem exists in real interviews

This tests partitioned top-1 with aggregation. Interviewers check whether you can group by two dimensions (year and table), count failures, and then surface the worst table per year.

> **Trick to Solving**
>
> This is a partitioned top-1 problem: for each year, find the table with the most failures. Use ROW_NUMBER or RANK partitioned by year, ordered by failure count.
> 
> 1. GROUP BY year, table to count failures
> 2. ROW_NUMBER() OVER (PARTITION BY year ORDER BY fail_count DESC)
> 3. Filter to rank = 1

---

### Break down the requirements

#### Step 1: Count failures per year per table

Extract year from `run_at`, filter to `passed = 0`, GROUP BY year and `tbl_name` with COUNT(*).

#### Step 2: Rank tables within each year

`ROW_NUMBER() OVER (PARTITION BY year ORDER BY fail_count DESC)` ranks tables by failure count within each year.

#### Step 3: Filter to the worst table per year

`WHERE rn = 1` returns only the top-failing table per year.

---

### The solution

**Partitioned top-1 by year for failure analysis**

```sql
SELECT year, tbl_name, fail_count
FROM (
    SELECT
        STRFTIME('%Y', run_at) AS year,
        tbl_name,
        COUNT(*) AS fail_count,
        ROW_NUMBER() OVER (
            PARTITION BY STRFTIME('%Y', run_at)
            ORDER BY COUNT(*) DESC
        ) AS rn
    FROM dq_checks
    WHERE passed = 0
    GROUP BY STRFTIME('%Y', run_at), tbl_name
) ranked
WHERE rn = 1
ORDER BY year
```

> **Cost Analysis**
>
> The WHERE passed = 0 filter reduces 700K rows to roughly half. The GROUP BY produces year x 250 tables rows. The window function sorts a small result set per year. Very efficient.

> **Interviewers Watch For**
>
> Whether you filter to failed checks (`passed = 0`) before grouping. Counting all checks instead of just failures would give total runs, not failure counts.

> **Common Pitfall**
>
> Using DENSE_RANK when only one table per year is needed. If two tables tie, ROW_NUMBER picks one arbitrarily; DENSE_RANK would return both. Choose based on the requirement.

---

## Common follow-up questions

- What if two tables tie for the most failures in a year? _(Switch from ROW_NUMBER to DENSE_RANK to include both.)_
- How would you show the top 3 worst tables per year? _(Change WHERE rn = 1 to WHERE rn <= 3.)_
- How would you add the failure rate (failures / total checks) alongside the count? _(Remove the WHERE passed = 0 filter and compute SUM(1 - passed) / COUNT(*) per group.)_
- What if run_at includes timestamps and you want monthly granularity? _(Change STRFTIME('%Y', ...) to STRFTIME('%Y-%m', ...) for monthly partitions.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/worst_table_per_year_by_dq_failures)
- [SQL Interview Questions](https://datadriven.io/sql-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.