# DQ Score Spread

> The spread in data quality scores.

Canonical URL: <https://datadriven.io/problems/dq_score_spread>

Domain: SQL · Difficulty: medium · Seniority: L4

## Problem

Each table's data quality total is the sum of fail percentages across all its rules. What is the difference between the table with the highest total and the table with the lowest total?

## Worked solution and explanation

### Why this problem exists in real interviews

This probes your ability to **aggregate to a derived grain** and then compute a scalar from that aggregate. The interviewer wants to see whether you understand that the spread requires two nested levels of aggregation.

> **Trick to Solving**
>
> When a prompt asks for "the difference between the highest and lowest" of a computed metric, you need a subquery or CTE that first computes the per-group metric, then an outer query that takes `MAX` minus `MIN`.
> 
> 1. Spot the two-level aggregation: per-table totals, then min/max across tables
> 2. Use a CTE or subquery for the first aggregation
> 3. Apply `MAX(...) - MIN(...)` on the outer query

---

### Break down the requirements

#### Step 1: Compute per-table DQ totals

`GROUP BY tbl_name` with `SUM(fail_pct)` collapses 400K rows in `dq_checks` to 150 rows.

#### Step 2: Calculate the spread

Wrap the aggregated result in a subquery and compute `MAX(total_fail) - MIN(total_fail)`.

---

### The solution

**Two-level aggregation with subquery**

```sql
SELECT MAX(total_fail) - MIN(total_fail) AS dq_spread
FROM (
    SELECT tbl_name, SUM(fail_pct) AS total_fail
    FROM dq_checks
    GROUP BY tbl_name
) t
```

> **Cost Analysis**
>
> The inner `GROUP BY` reduces 400K rows to 150. The outer `MAX`/`MIN` is trivial. A covering index on `(tbl_name, fail_pct)` avoids a full table scan.

> **Interviewers Watch For**
>
> Candidates who try `MAX(SUM(fail_pct))` hit a syntax error because nested aggregates are illegal without a subquery. The interviewer checks whether you instinctively reach for a CTE.

> **Common Pitfall**
>
> Nesting `MAX(SUM(...))` in a single query level is a syntax error in every dialect. Always compute the inner aggregate first, then operate on it in an outer query.

---

## Common follow-up questions

- How would you also return which tables had the highest and lowest totals? _(Tests extending the subquery with additional filtering or window functions.)_
- What if fail_pct could be NULL for some rows? _(Tests NULL-awareness: SUM ignores NULLs, but all-NULL groups get a NULL total.)_
- How would you compute the spread per severity level? _(Tests adding a dimension to both aggregation levels.)_
- What if the table had 50K distinct tbl_name values? _(Tests awareness that subquery output size affects outer query cost.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/dq_score_spread)
- [SQL Interview Questions](https://datadriven.io/sql-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.