SQL Interview Deep Dive

SQL DISTINCT and Deduplication Practice

DISTINCT is not a topic interviewers test in isolation. It is a tool you reach for when a JOIN fans out or when you need to deduplicate before aggregating. If your query needs DISTINCT, ask yourself whether the JOIN is wrong first.

Duplicate data is the most common data quality issue. Know every way to handle it.

Deduplication Techniques

Five approaches to handling duplicates, from simple to advanced. Each has different tradeoffs in portability, performance, and flexibility.

DISTINCT

SELECT DISTINCT city, state FROM customers

Removes duplicate rows from the result set. Operates on all selected columns together, not just the first one. SELECT DISTINCT city, state returns unique (city, state) pairs, not unique cities and unique states separately.

Common use cases

  • Get unique values from a column for exploration
  • Remove exact duplicate rows from a result
  • Count unique categories or types
  • Build a reference list from transactional data

Interview tip: DISTINCT applies to the entire row, not individual columns. A common mistake is thinking SELECT DISTINCT a, b deduplicates on column a alone. It deduplicates on the combination of (a, b).

DISTINCT ON (Postgres)

SELECT DISTINCT ON (user_id) user_id, event_type, created_at
FROM events
ORDER BY user_id, created_at DESC

Postgres-specific. Returns the first row for each unique value of the specified columns, based on the ORDER BY. This is the fastest way to get 'the most recent row per group' in Postgres. Not available in BigQuery, Snowflake, or MySQL.

Common use cases

  • Most recent event per user (faster than ROW_NUMBER in Postgres)
  • Latest price per product
  • First login per user per day
  • Dedup rows keeping the most complete version

Interview tip: DISTINCT ON requires that the ORDER BY starts with the same columns as the DISTINCT ON clause. The remaining ORDER BY columns determine which row is 'first.' If you use this in an interview, mention it is Postgres-only.

ROW_NUMBER for Deduplication

WITH ranked AS (
  SELECT *, ROW_NUMBER() OVER (
    PARTITION BY user_id ORDER BY updated_at DESC
  ) AS rn
  FROM users
)
SELECT * FROM ranked WHERE rn = 1

The most portable deduplication pattern. Works in every SQL dialect. Assign a row number within each group, ordered by your preference (most recent, highest priority, etc.), then filter to rn = 1.

Common use cases

  • Keep the most recent record per entity
  • Deduplicate based on a priority column
  • Pick one row per group with full control over which one
  • Clean up tables with multiple versions of the same record

Interview tip: This is the go-to dedup pattern for interviews. It works everywhere, it is explicit about which row you keep, and it handles ties predictably (ROW_NUMBER always picks one). If you only memorize one dedup technique, make it this one.

GROUP BY for Deduplication

SELECT user_id, MAX(updated_at) AS latest_update, MAX(email) AS email
FROM users
GROUP BY user_id

Uses aggregation to collapse duplicates. Works well when you need a few specific columns and can aggregate the rest meaningfully (MAX for most recent, MIN for earliest, ANY_VALUE for identical values).

Common use cases

  • Collapse multiple rows into one per key
  • Get the latest timestamp per entity
  • Merge duplicate records by taking MAX/MIN of each field
  • Simple dedup when you do not need all original columns

Interview tip: GROUP BY dedup is simpler than ROW_NUMBER but less flexible. You cannot easily get 'the email from the most recent row' without a subquery. If the interviewer asks for all columns from the best row, use ROW_NUMBER instead.

COUNT(DISTINCT)

SELECT COUNT(DISTINCT user_id) AS unique_users FROM events

Counts unique values in a column, ignoring NULLs. This is the standard way to calculate unique counts (unique visitors, unique products sold, etc.). Can be expensive on large datasets because it requires sorting or hashing all values.

Common use cases

  • Count unique users, products, or sessions
  • Calculate distinct counts within groups
  • Compare total events vs unique users
  • Measure cardinality of a column

Interview tip: COUNT(DISTINCT x) ignores NULLs. If NULLs represent something meaningful (like anonymous users), you need to handle them separately. Also, COUNT(DISTINCT x, y) is not valid in most dialects. Use COUNT(DISTINCT CONCAT(x, y)) or a subquery.

Deduplication Patterns for Data Pipelines

In production, deduplication is not a one-time fix. It is a recurring operation that runs every time your pipeline ingests data. The right approach depends on where duplicates come from and how often they appear.

At-least-once delivery. Most event streaming systems (Kafka, Pub/Sub, Kinesis) guarantee at-least-once delivery, which means duplicates are expected. The standard fix: assign each event a unique ID at the source and dedup on that ID during ingestion. ROW_NUMBER partitioned by event_id, ordered by ingestion_timestamp, keep rn = 1.

Late-arriving data. When the same entity arrives with different timestamps (a user profile updated multiple times), you need to keep the most recent version. This is a slowly changing dimension (SCD Type 1) pattern. ROW_NUMBER partitioned by entity_id, ordered by updated_at DESC, keep rn = 1. Run this in your transformation layer (dbt, Spark, or a scheduled SQL job).

Near-duplicates. The hardest case. Records that represent the same entity but have slightly different values (name casing, whitespace, abbreviations). No single SQL function solves this. You need a normalization step (LOWER, TRIM, REPLACE) to create a matching key, then dedup on that key. In data engineering interviews, describing this pipeline shows real-world experience.

Practice Problems

Five problems that test deduplication in realistic scenarios. From basic DISTINCT to multi-step sessionization.

1

Dedup Customer Records (Keep Most Recent)

Easy

A customers table has duplicate rows due to a broken ETL job. Each row has (customer_id, name, email, phone, updated_at). Some customers have 2-5 duplicate rows with different updated_at values. Write a query that returns exactly one row per customer: the one with the latest updated_at.

Show hint

Use ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY updated_at DESC) AS rn, then filter WHERE rn = 1. This is the standard pattern and works in every dialect.

2

Daily Active Users Without Double-Counting

Medium

An events table has (user_id, event_type, event_timestamp). A user who triggers 50 events in a day should count as 1 DAU. Write a query that returns the DAU count for each day in the last 30 days. Handle days with zero activity.

Show hint

COUNT(DISTINCT user_id) grouped by DATE_TRUNC(event_timestamp, day). For zero-activity days, generate a date series and LEFT JOIN the activity counts. COALESCE the count to 0.

3

Find and Remove Near-Duplicates

Hard

A products table has (id, name, price, category). Some products are near-duplicates: same category, same price, and names that differ only in casing or trailing whitespace. Write a query that identifies all near-duplicate groups and keeps only the row with the lowest id.

Show hint

Normalize names with LOWER(TRIM(name)). Group by (LOWER(TRIM(name)), price, category). Use ROW_NUMBER within each group ordered by id ASC. Rows where rn > 1 are the duplicates to remove.

4

Distinct Products Per Customer Segment

Medium

Given orders (order_id, customer_id, product_id) and customers (customer_id, segment), calculate the number of distinct products purchased by each customer segment. Then find which product is purchased by the most segments.

Show hint

First query: COUNT(DISTINCT product_id) grouped by segment. Second query: COUNT(DISTINCT segment) grouped by product_id, ordered DESC, LIMIT 1. Two straightforward COUNT(DISTINCT) queries.

5

Sessionize Events with Dedup

Hard

An events table has (user_id, event_type, event_timestamp). Events within 30 minutes of each other belong to the same session. Some events are exact duplicates (same user, type, and timestamp). First deduplicate exact duplicates, then assign session IDs, then return session_id, user_id, session_start, session_end, and event_count per session.

Show hint

Step 1: DISTINCT or GROUP BY to remove exact dupes. Step 2: Use LAG(event_timestamp) to find gaps > 30 minutes. Step 3: SUM a flag (1 when gap > 30 min, 0 otherwise) as a running total to create session IDs. This is the gaps-and-islands technique.

Worked Example: Deduplicate Customer Records With ROW_NUMBER

A broken ETL job duplicated rows in the customers table. Each customer may have 1-5 rows with different updated_at timestamps. Keep only the most recent row per customer.

WITH ranked AS (
  SELECT
    *,
    ROW_NUMBER() OVER (
      PARTITION BY customer_id
      ORDER BY updated_at DESC
    ) AS rn
  FROM customers
)
SELECT customer_id, name, email, phone, updated_at
FROM ranked
WHERE rn = 1
ORDER BY customer_id;

ROW_NUMBER assigns 1 to the most recent row within each customer_id group (because of ORDER BY updated_at DESC). Filtering WHERE rn = 1 keeps exactly one row per customer. This pattern works in every SQL dialect and gives you full control over which row survives. Unlike DISTINCT, you keep all columns from the winning row without needing to aggregate anything.

Expected output

 customer_id | name         | email             | phone      | updated_at
-------------+--------------+-------------------+------------+---------------------
           1 | Alice Chen   | alice@example.com | 5551234567 | 2024-03-15 09:22:00
           2 | Bob Park     | bob@newmail.com   | 5559876543 | 2024-03-14 16:45:00
           3 | Carol Jones  | carol@work.io     | 5555551212 | 2024-03-12 11:03:00

Deduplication FAQ

What is the difference between DISTINCT and GROUP BY for deduplication?+
DISTINCT removes duplicate rows from the result. GROUP BY collapses rows and lets you apply aggregate functions (MAX, MIN, COUNT) to the grouped rows. For simple dedup where you just want unique rows, both produce the same result. For dedup where you need to pick specific values from the duplicates (like the most recent timestamp), GROUP BY with aggregates or ROW_NUMBER is required.
Is DISTINCT ON available outside of Postgres?+
No. DISTINCT ON is a Postgres extension not part of the SQL standard. BigQuery, Snowflake, MySQL, and SQL Server do not support it. The equivalent in other dialects is the ROW_NUMBER() OVER (PARTITION BY ... ORDER BY ...) pattern with a CTE or subquery. In interviews, use ROW_NUMBER unless the interviewer specifically says Postgres.
How does COUNT(DISTINCT) handle NULLs?+
COUNT(DISTINCT column) ignores NULL values entirely. If a column has values [1, 2, NULL, 2, NULL], COUNT(DISTINCT column) returns 2 (the values 1 and 2). COUNT(*) would return 5. This is consistent with how all aggregate functions in SQL treat NULLs: they skip them. If you need to count NULLs as a distinct value, use COUNT(DISTINCT COALESCE(column, sentinel_value)).
What is the most efficient way to deduplicate a large table?+
For Postgres, DISTINCT ON is typically fastest because the optimizer can use an index scan with early termination. For other dialects, ROW_NUMBER with a CTE is the standard approach. GROUP BY with MAX/MIN can be faster than ROW_NUMBER when you only need a few aggregate columns. For very large tables, consider materializing the deduped result into a new table rather than deduplicating on every query. Always check the query plan with EXPLAIN ANALYZE.

Master Every Dedup Technique

Duplicate data shows up in every pipeline and every interview. Practice DISTINCT, ROW_NUMBER dedup, and COUNT(DISTINCT) with real SQL execution.