# Duplicate DQ Check Records

> Passed QA twice. That's the problem.

Canonical URL: <https://datadriven.io/problems/duplicate_dq_check_records>

Domain: SQL · Difficulty: medium · Seniority: L3

## Problem

A downstream team reported that some entries in the data quality checks table appear more than once for the same table and column combination. Surface any table-and-column pairs that have multiple records, along with the duplicate count.

## Worked solution and explanation

### Why this problem exists in real interviews

Duplicate detection is one of the most common data quality tasks in production pipelines. This tests whether you understand the `GROUP BY` / `HAVING` pattern and whether you correctly identify which columns define a "duplicate."

---

### Break down the requirements

#### Step 1: Group by the duplicate key

`GROUP BY tbl_name, col_name` treats each table-column pair as a distinct entity, matching the prompt's definition of duplicates.

#### Step 2: Filter with HAVING

`HAVING COUNT(*) > 1` keeps only groups with multiple records. Include `COUNT(*)` in the SELECT for the duplicate count.

#### Step 3: Order for readability

Sort by `dup_count DESC` so the worst offenders appear first for triage.

---

### The solution

**GROUP BY with HAVING for duplicate detection**

```sql
SELECT tbl_name, col_name, COUNT(*) AS dup_count
FROM dq_checks
GROUP BY tbl_name, col_name
HAVING COUNT(*) > 1
ORDER BY dup_count DESC
```

> **Cost Analysis**
>
> Single pass over 800K rows with hash aggregation. A composite index on `(tbl_name, col_name)` enables an index-only scan.

> **Interviewers Watch For**
>
> The key signal is whether the candidate identifies `tbl_name, col_name` as the duplicate key from the prompt, rather than including extra columns like `rule`.

> **Common Pitfall**
>
> Including too many columns in GROUP BY (e.g., `rule` or `run_at`) fragments the groups and hides real duplicates. Read the prompt carefully.

---

## Common follow-up questions

- How would you keep one record per duplicate group and delete the rest? _(Tests ROW_NUMBER-based deduplication or DELETE with a CTE.)_
- What if the duplicate definition also included the rule column? _(Tests adjusting the GROUP BY key based on business requirements.)_
- How would you prevent duplicates from appearing in the first place? _(Tests UNIQUE constraints and upsert patterns.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/duplicate_dq_check_records)
- [SQL Interview Questions](https://datadriven.io/sql-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.