SQL DISTINCT and Deduplication Practice

DISTINCT is not a topic interviewers test in isolation. It is a tool you reach for when a JOIN fans out or when you need to deduplicate before aggregating. If your query needs DISTINCT, ask yourself whether the JOIN is wrong first.

Deduplication Techniques

Five approaches to handling duplicates, from simple to advanced. Each has different tradeoffs in portability, performance, and flexibility.

DISTINCT

SELECT DISTINCT city, state FROM customers

Removes duplicate rows from the result set. Operates on all selected columns together, not just the first one. SELECT DISTINCT city, state returns unique (city, state) pairs, not unique cities and unique states separately.

Common use cases

  • Get unique values from a column for exploration
  • Remove exact duplicate rows from a result
  • Count unique categories or types
  • Build a reference list from transactional data

Interview tip: DISTINCT applies to the entire row, not individual columns. A common mistake is thinking SELECT DISTINCT a, b deduplicates on column a alone. It deduplicates on the combination of (a, b).

DISTINCT ON (Postgres)

SELECT DISTINCT ON (user_id) user_id, event_type, created_at
FROM events
ORDER BY user_id, created_at DESC

Postgres-specific. Returns the first row for each unique value of the specified columns, based on the ORDER BY. This is the fastest way to get 'the most recent row per group' in Postgres. Not available in BigQuery, Snowflake, or MySQL.

Common use cases

  • Most recent event per user (faster than ROW_NUMBER in Postgres)
  • Latest price per product
  • First login per user per day
  • Dedup rows keeping the most complete version

Interview tip: DISTINCT ON requires that the ORDER BY starts with the same columns as the DISTINCT ON clause. The remaining ORDER BY columns determine which row is 'first.' If you use this in an interview, mention it is Postgres-only.

ROW_NUMBER for Deduplication

WITH ranked AS (
  SELECT *, ROW_NUMBER() OVER (
    PARTITION BY user_id ORDER BY updated_at DESC
  ) AS rn
  FROM users
)
SELECT * FROM ranked WHERE rn = 1

The most portable deduplication pattern. Works in every SQL dialect. Assign a row number within each group, ordered by your preference (most recent, highest priority, etc.), then filter to rn = 1.

Common use cases

  • Keep the most recent record per entity
  • Deduplicate based on a priority column
  • Pick one row per group with full control over which one
  • Clean up tables with multiple versions of the same record

Interview tip: This is the go-to dedup pattern for interviews. It works everywhere, it is explicit about which row you keep, and it handles ties predictably (ROW_NUMBER always picks one). If you only memorize one dedup technique, make it this one.

GROUP BY for Deduplication

SELECT user_id, MAX(updated_at) AS latest_update, MAX(email) AS email
FROM users
GROUP BY user_id

Uses aggregation to collapse duplicates. Works well when you need a few specific columns and can aggregate the rest meaningfully (MAX for most recent, MIN for earliest, ANY_VALUE for identical values).

Common use cases

  • Collapse multiple rows into one per key
  • Get the latest timestamp per entity
  • Merge duplicate records by taking MAX/MIN of each field
  • Simple dedup when you do not need all original columns

Interview tip: GROUP BY dedup is simpler than ROW_NUMBER but less flexible. You cannot easily get 'the email from the most recent row' without a subquery. If the interviewer asks for all columns from the best row, use ROW_NUMBER instead.

COUNT(DISTINCT)

SELECT COUNT(DISTINCT user_id) AS unique_users FROM events

Counts unique values in a column, ignoring NULLs. This is the standard way to calculate unique counts (unique visitors, unique products sold, etc.). Can be expensive on large datasets because it requires sorting or hashing all values.

Common use cases

  • Count unique users, products, or sessions
  • Calculate distinct counts within groups
  • Compare total events vs unique users
  • Measure cardinality of a column

Interview tip: COUNT(DISTINCT x) ignores NULLs. If NULLs represent something meaningful (like anonymous users), you need to handle them separately. Also, COUNT(DISTINCT x, y) is not valid in most dialects. Use COUNT(DISTINCT CONCAT(x, y)) or a subquery.

Deduplication Patterns for Data Pipelines

In production, deduplication is not a one-time fix. It is a recurring operation that runs every time your pipeline ingests data. The right approach depends on where duplicates come from and how often they appear. At-least-once delivery: Most event streaming systems (Kafka, Pub/Sub, Kinesis) guarantee at-least-once delivery, which means duplicates are expected. The standard fix: assign each event a unique ID at the source and dedup on that ID during ingestion. ROW_NUMBER partitioned by event_id, ordered by ingestion_timestamp, keep rn = 1. Late-arriving data: When the same entity arrives with different timestamps (a user profile updated multiple times), you need to keep the most recent version. This is a slowly changing dimension (SCD Type 1) pattern. ROW_NUMBER partitioned by entity_id, ordered by updated_at DESC, keep rn = 1. Run this in your transformation layer (dbt, Spark, or a scheduled SQL job). Near-duplicates: The hardest case. Records that represent the same entity but have slightly different values (name casing, whitespace, abbreviations). No single SQL function solves this. You need a normalization step (LOWER, TRIM, REPLACE) to create a matching key, then dedup on that key. In data engineering interviews, describing this pipeline shows real-world experience.
Prepare for the interview
01 / Open invite
02min.

Know DISTINCT the way the interviewer who asks it knows it.

a DISTINCT query, the same shape a screen would give you.
The diff against expected. Where ties broke. What you missed.
sandbox
1SELECT user_id,
2 COUNT(*) AS sessions
3FROM events
4WHERE ts >= NOW() - INTERVAL '7 day'
5
Execute your solution0.4s avg.
MetaInterview question
Solve a DISTINCT problem

Practice Problems

Five problems that test deduplication in realistic scenarios. From basic DISTINCT to multi-step sessionization.

1. Dedup Customer Records (Keep Most Recent) (Easy)

A customers table has duplicate rows due to a broken ETL job. Each row has (customer_id, name, email, phone, updated_at). Some customers have 2-5 duplicate rows with different updated_at values. Write a query that returns exactly one row per customer: the one with the latest updated_at.

Show hint

Use ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY updated_at DESC) AS rn, then filter WHERE rn = 1. This is the standard pattern and works in every dialect.

2. Daily Active Users Without Double-Counting (Medium)

An events table has (user_id, event_type, event_timestamp). A user who triggers 50 events in a day should count as 1 DAU. Write a query that returns the DAU count for each day in the last 30 days. Handle days with zero activity.

Show hint

COUNT(DISTINCT user_id) grouped by DATE_TRUNC(event_timestamp, day). For zero-activity days, generate a date series and LEFT JOIN the activity counts. COALESCE the count to 0.

3. Find and Remove Near-Duplicates (Hard)

A products table has (id, name, price, category). Some products are near-duplicates: same category, same price, and names that differ only in casing or trailing whitespace. Write a query that identifies all near-duplicate groups and keeps only the row with the lowest id.

Show hint

Normalize names with LOWER(TRIM(name)). Group by (LOWER(TRIM(name)), price, category). Use ROW_NUMBER within each group ordered by id ASC. Rows where rn > 1 are the duplicates to remove.

4. Distinct Products Per Customer Segment (Medium)

Given orders (order_id, customer_id, product_id) and customers (customer_id, segment), calculate the number of distinct products purchased by each customer segment. Then find which product is purchased by the most segments.

Show hint

First query: COUNT(DISTINCT product_id) grouped by segment. Second query: COUNT(DISTINCT segment) grouped by product_id, ordered DESC, LIMIT 1. Two straightforward COUNT(DISTINCT) queries.

5. Sessionize Events with Dedup (Hard)

An events table has (user_id, event_type, event_timestamp). Events within 30 minutes of each other belong to the same session. Some events are exact duplicates (same user, type, and timestamp). First deduplicate exact duplicates, then assign session IDs, then return session_id, user_id, session_start, session_end, and event_count per session.

Show hint

Step 1: DISTINCT or GROUP BY to remove exact dupes. Step 2: Use LAG(event_timestamp) to find gaps > 30 minutes. Step 3: SUM a flag (1 when gap > 30 min, 0 otherwise) as a running total to create session IDs. This is the gaps-and-islands technique.

Worked Example: Deduplicate Customer Records With ROW_NUMBER

A broken ETL job duplicated rows in the customers table. Each customer may have 1-5 rows with different updated_at timestamps. Keep only the most recent row per customer.
WITH ranked AS (
  SELECT
    *,
    ROW_NUMBER() OVER (
      PARTITION BY user_id
      ORDER BY transaction_date DESC
    ) AS rn
  FROM transactions
)
SELECT user_id, product_id, total_amount, transaction_date
FROM ranked
WHERE rn = 1
ORDER BY user_id;
ROW_NUMBER assigns 1 to the most recent row within each customer_id group (because of ORDER BY updated_at DESC). Filtering WHERE rn = 1 keeps exactly one row per customer. This pattern works in every SQL dialect and gives you full control over which row survives. Unlike DISTINCT, you keep all columns from the winning row without needing to aggregate anything.
 customer_id | name         | email             | phone      | updated_at
-------------+--------------+-------------------+------------+---------------------
           1 | Alice Chen   | alice@example.com | 5551234567 | 2024-03-15 09:22:00
           2 | Bob Park     | bob@newmail.com   | 5559876543 | 2024-03-14 16:45:00
           3 | Carol Jones  | carol@work.io     | 5555551212 | 2024-03-12 11:03:00

Frequently asked questions

What is the difference between DISTINCT and GROUP BY for deduplication?+
DISTINCT removes duplicate rows from the result. GROUP BY collapses rows and lets you apply aggregate functions (MAX, MIN, COUNT) to the grouped rows. For simple dedup where you just want unique rows, both produce the same result. For dedup where you need to pick specific values from the duplicates (like the most recent timestamp), GROUP BY with aggregates or ROW_NUMBER is required.
Is DISTINCT ON available outside of Postgres?+
No. DISTINCT ON is a Postgres extension not part of the SQL standard. BigQuery, Snowflake, MySQL, and SQL Server do not support it. The equivalent in other dialects is the ROW_NUMBER() OVER (PARTITION BY ... ORDER BY ...) pattern with a CTE or subquery. In interviews, use ROW_NUMBER unless the interviewer specifically says Postgres.
How does COUNT(DISTINCT) handle NULLs?+
COUNT(DISTINCT column) ignores NULL values entirely. If a column has values [1, 2, NULL, 2, NULL], COUNT(DISTINCT column) returns 2 (the values 1 and 2). COUNT(*) would return 5. This is consistent with how all aggregate functions in SQL treat NULLs: they skip them. If you need to count NULLs as a distinct value, use COUNT(DISTINCT COALESCE(column, sentinel_value)).
What is the most efficient way to deduplicate a large table?+
For Postgres, DISTINCT ON is typically fastest because the optimizer can use an index scan with early termination. For other dialects, ROW_NUMBER with a CTE is the standard approach. GROUP BY with MAX/MIN can be faster than ROW_NUMBER when you only need a few aggregate columns. For very large tables, consider materializing the deduped result into a new table rather than deduplicating on every query. Always check the query plan with EXPLAIN ANALYZE.
02 / Why practice

Master Every Dedup Technique

  1. 01

    Active recall beats re-reading by 50%

    Cognitive-science meta-reviews (Dunlosky et al., 2013) rank practice testing as a top-tier study technique, while re-reading and highlighting rank near the bottom

  2. 02

    76% of hiring managers reject on the coding task, not the resume

    From HackerRank's 2024 Developer Skills Report. Candidates who look strong on paper still fail the live screen if they haven't done timed, executable practice

  3. 03

    Five problem shapes cover 80% of data engineer loops

    Dedup, sessionization, top-N-per-group, slowly-changing dimensions, partition tricks. Writing the shapes by hand turns the unfamiliar into pattern recognition

More reading