# Data Quality

> Failed checks pile up. Which tables need the most attention?

Canonical URL: <https://datadriven.io/problems/data_quality>

Domain: SQL · Difficulty: hard · Seniority: L4

## Problem

The data engineering team wants to identify databases with a clean migration record. For each database, compute the total number of migrations and total rollbacks. Only return databases that have never had a rollback.

## Worked solution and explanation

### Why this problem exists in real interviews

This tests the universal quantifier "never had a rollback" expressed as a GROUP BY with HAVING. The challenge is expressing "zero occurrences of a condition" using aggregate functions.

---

### Break down the requirements

#### Step 1: Aggregate per database

`GROUP BY db_name` with `COUNT(*)` for total migrations and conditional count for rollbacks.

#### Step 2: Filter for zero rollbacks

`HAVING SUM(CASE WHEN rollback = true THEN 1 ELSE 0 END) = 0` keeps only clean databases.

---

### The solution

**Zero-occurrence filter with conditional HAVING**

```sql
SELECT
    db_name,
    COUNT(*) AS total_migrations,
    SUM(CASE WHEN rollback = true THEN 1 ELSE 0 END) AS total_rollbacks
FROM migrations
GROUP BY db_name
HAVING SUM(CASE WHEN rollback = true THEN 1 ELSE 0 END) = 0
```

> **Cost Analysis**
>
> Scan of 5K rows. Trivially fast. The conditional sum and HAVING filter add negligible overhead.

> **Interviewers Watch For**
>
> Whether the candidate uses the clean HAVING approach vs a NOT EXISTS correlated subquery. Both are correct, but the aggregate approach is more concise for this pattern.

> **Common Pitfall**
>
> Using `WHERE rollback = false` before GROUP BY would exclude rollback rows from the count entirely, meaning you could not detect whether a database ever had one. The HAVING approach preserves all rows for the check.

---

## Common follow-up questions

- How would you express this with NOT EXISTS? _(WHERE NOT EXISTS (SELECT 1 FROM migrations m2 WHERE m2.db_name = m.db_name AND m2.rollback = true).)_
- What if rollback is NULL for some rows? _(The CASE would treat NULL as the ELSE branch (0), effectively ignoring NULLs. Tests NULL awareness.)_
- How would you also show the date of the last migration per database? _(Add MAX(applied) to the SELECT.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/data_quality)
- [SQL Interview Questions](https://datadriven.io/sql-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.