# Verify Commit ID Uniqueness

> Duplicate commit IDs. Are there any?

Canonical URL: <https://datadriven.io/problems/verify_commit_id_uniqueness>

Domain: SQL · Difficulty: easy · Seniority: L3

## Problem

You suspect duplicate commit IDs crept into the repo metadata table. Show the total count of commit IDs next to the count of unique commit IDs so you can verify at a glance.

## Worked solution and explanation

### Why this problem exists in real interviews

This tests data quality validation using aggregate comparison. Interviewers check whether you can use `COUNT` vs `COUNT(DISTINCT)` to detect duplicates in a single query.

---

### Break down the requirements

#### Step 1: Count total commit IDs

`COUNT(commit_id)` returns the total number of commit ID values (excluding NULLs).

#### Step 2: Count distinct commit IDs

`COUNT(DISTINCT commit_id)` returns the number of unique values. If both counts match, there are no duplicates.

---

### The solution

**Side-by-side total vs distinct for duplicate detection**

```sql
SELECT
    COUNT(commit_id) AS total_count,
    COUNT(DISTINCT commit_id) AS unique_count
FROM repo_commits
```

> **Cost Analysis**
>
> Single scan of 3M rows. COUNT(DISTINCT) builds a hash set of 3M string values. Memory usage depends on commit_id length (typically 40-char SHA hashes, so ~120MB for the hash set).

> **Interviewers Watch For**
>
> Whether you also consider NULL commit_ids. `COUNT(commit_id)` excludes NULLs, so if NULLs exist, the total count may be less than `COUNT(*)`. In this table, commit_id has 0% null fraction.

> **Common Pitfall**
>
> Using `COUNT(*)` instead of `COUNT(commit_id)`. If commit_id could be NULL, `COUNT(*)` counts the row but does not reveal whether the ID itself is populated.

---

## Common follow-up questions

- If duplicates exist, how would you find them? _(GROUP BY commit_id HAVING COUNT(*) > 1 lists all duplicate IDs.)_
- What if you needed to verify uniqueness per repo? _(Add GROUP BY repo_name to check uniqueness within each repository.)_
- How would you enforce uniqueness going forward? _(Add a UNIQUE constraint or index on commit_id to prevent future duplicates.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/verify_commit_id_uniqueness)
- [SQL Interview Questions](https://datadriven.io/sql-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.