SQL DISTINCT and Deduplication Practice
DISTINCT is not a topic interviewers test in isolation. It is a tool you reach for when a JOIN fans out or when you need to deduplicate before aggregating. If your query needs DISTINCT, ask yourself whether the JOIN is wrong first.
Deduplication Techniques
DISTINCT
SELECT DISTINCT city, state FROM customersRemoves duplicate rows from the result set. Operates on all selected columns together, not just the first one. SELECT DISTINCT city, state returns unique (city, state) pairs, not unique cities and unique states separately.
Common use cases
- Get unique values from a column for exploration
- Remove exact duplicate rows from a result
- Count unique categories or types
- Build a reference list from transactional data
Interview tip: DISTINCT applies to the entire row, not individual columns. A common mistake is thinking SELECT DISTINCT a, b deduplicates on column a alone. It deduplicates on the combination of (a, b).
DISTINCT ON (Postgres)
SELECT DISTINCT ON (user_id) user_id, event_type, created_at
FROM events
ORDER BY user_id, created_at DESCPostgres-specific. Returns the first row for each unique value of the specified columns, based on the ORDER BY. This is the fastest way to get 'the most recent row per group' in Postgres. Not available in BigQuery, Snowflake, or MySQL.
Common use cases
- Most recent event per user (faster than ROW_NUMBER in Postgres)
- Latest price per product
- First login per user per day
- Dedup rows keeping the most complete version
Interview tip: DISTINCT ON requires that the ORDER BY starts with the same columns as the DISTINCT ON clause. The remaining ORDER BY columns determine which row is 'first.' If you use this in an interview, mention it is Postgres-only.
ROW_NUMBER for Deduplication
WITH ranked AS (
SELECT *, ROW_NUMBER() OVER (
PARTITION BY user_id ORDER BY updated_at DESC
) AS rn
FROM users
)
SELECT * FROM ranked WHERE rn = 1The most portable deduplication pattern. Works in every SQL dialect. Assign a row number within each group, ordered by your preference (most recent, highest priority, etc.), then filter to rn = 1.
Common use cases
- Keep the most recent record per entity
- Deduplicate based on a priority column
- Pick one row per group with full control over which one
- Clean up tables with multiple versions of the same record
Interview tip: This is the go-to dedup pattern for interviews. It works everywhere, it is explicit about which row you keep, and it handles ties predictably (ROW_NUMBER always picks one). If you only memorize one dedup technique, make it this one.
GROUP BY for Deduplication
SELECT user_id, MAX(updated_at) AS latest_update, MAX(email) AS email
FROM users
GROUP BY user_idUses aggregation to collapse duplicates. Works well when you need a few specific columns and can aggregate the rest meaningfully (MAX for most recent, MIN for earliest, ANY_VALUE for identical values).
Common use cases
- Collapse multiple rows into one per key
- Get the latest timestamp per entity
- Merge duplicate records by taking MAX/MIN of each field
- Simple dedup when you do not need all original columns
Interview tip: GROUP BY dedup is simpler than ROW_NUMBER but less flexible. You cannot easily get 'the email from the most recent row' without a subquery. If the interviewer asks for all columns from the best row, use ROW_NUMBER instead.
COUNT(DISTINCT)
SELECT COUNT(DISTINCT user_id) AS unique_users FROM eventsCounts unique values in a column, ignoring NULLs. This is the standard way to calculate unique counts (unique visitors, unique products sold, etc.). Can be expensive on large datasets because it requires sorting or hashing all values.
Common use cases
- Count unique users, products, or sessions
- Calculate distinct counts within groups
- Compare total events vs unique users
- Measure cardinality of a column
Interview tip: COUNT(DISTINCT x) ignores NULLs. If NULLs represent something meaningful (like anonymous users), you need to handle them separately. Also, COUNT(DISTINCT x, y) is not valid in most dialects. Use COUNT(DISTINCT CONCAT(x, y)) or a subquery.
Deduplication Patterns for Data Pipelines
Know DISTINCT the way the interviewer who asks it knows it.
Practice Problems
1. Dedup Customer Records (Keep Most Recent) (Easy)
A customers table has duplicate rows due to a broken ETL job. Each row has (customer_id, name, email, phone, updated_at). Some customers have 2-5 duplicate rows with different updated_at values. Write a query that returns exactly one row per customer: the one with the latest updated_at.
Show hint
Use ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY updated_at DESC) AS rn, then filter WHERE rn = 1. This is the standard pattern and works in every dialect.
2. Daily Active Users Without Double-Counting (Medium)
An events table has (user_id, event_type, event_timestamp). A user who triggers 50 events in a day should count as 1 DAU. Write a query that returns the DAU count for each day in the last 30 days. Handle days with zero activity.
Show hint
COUNT(DISTINCT user_id) grouped by DATE_TRUNC(event_timestamp, day). For zero-activity days, generate a date series and LEFT JOIN the activity counts. COALESCE the count to 0.
3. Find and Remove Near-Duplicates (Hard)
A products table has (id, name, price, category). Some products are near-duplicates: same category, same price, and names that differ only in casing or trailing whitespace. Write a query that identifies all near-duplicate groups and keeps only the row with the lowest id.
Show hint
Normalize names with LOWER(TRIM(name)). Group by (LOWER(TRIM(name)), price, category). Use ROW_NUMBER within each group ordered by id ASC. Rows where rn > 1 are the duplicates to remove.
4. Distinct Products Per Customer Segment (Medium)
Given orders (order_id, customer_id, product_id) and customers (customer_id, segment), calculate the number of distinct products purchased by each customer segment. Then find which product is purchased by the most segments.
Show hint
First query: COUNT(DISTINCT product_id) grouped by segment. Second query: COUNT(DISTINCT segment) grouped by product_id, ordered DESC, LIMIT 1. Two straightforward COUNT(DISTINCT) queries.
5. Sessionize Events with Dedup (Hard)
An events table has (user_id, event_type, event_timestamp). Events within 30 minutes of each other belong to the same session. Some events are exact duplicates (same user, type, and timestamp). First deduplicate exact duplicates, then assign session IDs, then return session_id, user_id, session_start, session_end, and event_count per session.
Show hint
Step 1: DISTINCT or GROUP BY to remove exact dupes. Step 2: Use LAG(event_timestamp) to find gaps > 30 minutes. Step 3: SUM a flag (1 when gap > 30 min, 0 otherwise) as a running total to create session IDs. This is the gaps-and-islands technique.
Worked Example: Deduplicate Customer Records With ROW_NUMBER
WITH ranked AS (
SELECT
*,
ROW_NUMBER() OVER (
PARTITION BY user_id
ORDER BY transaction_date DESC
) AS rn
FROM transactions
)
SELECT user_id, product_id, total_amount, transaction_date
FROM ranked
WHERE rn = 1
ORDER BY user_id; customer_id | name | email | phone | updated_at
-------------+--------------+-------------------+------------+---------------------
1 | Alice Chen | alice@example.com | 5551234567 | 2024-03-15 09:22:00
2 | Bob Park | bob@newmail.com | 5559876543 | 2024-03-14 16:45:00
3 | Carol Jones | carol@work.io | 5555551212 | 2024-03-12 11:03:00Frequently asked questions
What is the difference between DISTINCT and GROUP BY for deduplication?+
Is DISTINCT ON available outside of Postgres?+
How does COUNT(DISTINCT) handle NULLs?+
What is the most efficient way to deduplicate a large table?+
Master Every Dedup Technique
- 01
Active recall beats re-reading by 50%
Cognitive-science meta-reviews (Dunlosky et al., 2013) rank practice testing as a top-tier study technique, while re-reading and highlighting rank near the bottom
- 02
76% of hiring managers reject on the coding task, not the resume
From HackerRank's 2024 Developer Skills Report. Candidates who look strong on paper still fail the live screen if they haven't done timed, executable practice
- 03
Five problem shapes cover 80% of data engineer loops
Dedup, sessionization, top-N-per-group, slowly-changing dimensions, partition tricks. Writing the shapes by hand turns the unfamiliar into pattern recognition