SQL Interview Deep Dive
DISTINCT is not a topic interviewers test in isolation. It is a tool you reach for when a JOIN fans out or when you need to deduplicate before aggregating. If your query needs DISTINCT, ask yourself whether the JOIN is wrong first.
Duplicate data is the most common data quality issue. Know every way to handle it.
Five approaches to handling duplicates, from simple to advanced. Each has different tradeoffs in portability, performance, and flexibility.
SELECT DISTINCT city, state FROM customersRemoves duplicate rows from the result set. Operates on all selected columns together, not just the first one. SELECT DISTINCT city, state returns unique (city, state) pairs, not unique cities and unique states separately.
Interview tip: DISTINCT applies to the entire row, not individual columns. A common mistake is thinking SELECT DISTINCT a, b deduplicates on column a alone. It deduplicates on the combination of (a, b).
SELECT DISTINCT ON (user_id) user_id, event_type, created_at
FROM events
ORDER BY user_id, created_at DESCPostgres-specific. Returns the first row for each unique value of the specified columns, based on the ORDER BY. This is the fastest way to get 'the most recent row per group' in Postgres. Not available in BigQuery, Snowflake, or MySQL.
Interview tip: DISTINCT ON requires that the ORDER BY starts with the same columns as the DISTINCT ON clause. The remaining ORDER BY columns determine which row is 'first.' If you use this in an interview, mention it is Postgres-only.
WITH ranked AS (
SELECT *, ROW_NUMBER() OVER (
PARTITION BY user_id ORDER BY updated_at DESC
) AS rn
FROM users
)
SELECT * FROM ranked WHERE rn = 1The most portable deduplication pattern. Works in every SQL dialect. Assign a row number within each group, ordered by your preference (most recent, highest priority, etc.), then filter to rn = 1.
Interview tip: This is the go-to dedup pattern for interviews. It works everywhere, it is explicit about which row you keep, and it handles ties predictably (ROW_NUMBER always picks one). If you only memorize one dedup technique, make it this one.
SELECT user_id, MAX(updated_at) AS latest_update, MAX(email) AS email
FROM users
GROUP BY user_idUses aggregation to collapse duplicates. Works well when you need a few specific columns and can aggregate the rest meaningfully (MAX for most recent, MIN for earliest, ANY_VALUE for identical values).
Interview tip: GROUP BY dedup is simpler than ROW_NUMBER but less flexible. You cannot easily get 'the email from the most recent row' without a subquery. If the interviewer asks for all columns from the best row, use ROW_NUMBER instead.
SELECT COUNT(DISTINCT user_id) AS unique_users FROM eventsCounts unique values in a column, ignoring NULLs. This is the standard way to calculate unique counts (unique visitors, unique products sold, etc.). Can be expensive on large datasets because it requires sorting or hashing all values.
Interview tip: COUNT(DISTINCT x) ignores NULLs. If NULLs represent something meaningful (like anonymous users), you need to handle them separately. Also, COUNT(DISTINCT x, y) is not valid in most dialects. Use COUNT(DISTINCT CONCAT(x, y)) or a subquery.
In production, deduplication is not a one-time fix. It is a recurring operation that runs every time your pipeline ingests data. The right approach depends on where duplicates come from and how often they appear.
At-least-once delivery. Most event streaming systems (Kafka, Pub/Sub, Kinesis) guarantee at-least-once delivery, which means duplicates are expected. The standard fix: assign each event a unique ID at the source and dedup on that ID during ingestion. ROW_NUMBER partitioned by event_id, ordered by ingestion_timestamp, keep rn = 1.
Late-arriving data. When the same entity arrives with different timestamps (a user profile updated multiple times), you need to keep the most recent version. This is a slowly changing dimension (SCD Type 1) pattern. ROW_NUMBER partitioned by entity_id, ordered by updated_at DESC, keep rn = 1. Run this in your transformation layer (dbt, Spark, or a scheduled SQL job).
Near-duplicates. The hardest case. Records that represent the same entity but have slightly different values (name casing, whitespace, abbreviations). No single SQL function solves this. You need a normalization step (LOWER, TRIM, REPLACE) to create a matching key, then dedup on that key. In data engineering interviews, describing this pipeline shows real-world experience.
Five problems that test deduplication in realistic scenarios. From basic DISTINCT to multi-step sessionization.
A customers table has duplicate rows due to a broken ETL job. Each row has (customer_id, name, email, phone, updated_at). Some customers have 2-5 duplicate rows with different updated_at values. Write a query that returns exactly one row per customer: the one with the latest updated_at.
Use ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY updated_at DESC) AS rn, then filter WHERE rn = 1. This is the standard pattern and works in every dialect.
An events table has (user_id, event_type, event_timestamp). A user who triggers 50 events in a day should count as 1 DAU. Write a query that returns the DAU count for each day in the last 30 days. Handle days with zero activity.
COUNT(DISTINCT user_id) grouped by DATE_TRUNC(event_timestamp, day). For zero-activity days, generate a date series and LEFT JOIN the activity counts. COALESCE the count to 0.
A products table has (id, name, price, category). Some products are near-duplicates: same category, same price, and names that differ only in casing or trailing whitespace. Write a query that identifies all near-duplicate groups and keeps only the row with the lowest id.
Normalize names with LOWER(TRIM(name)). Group by (LOWER(TRIM(name)), price, category). Use ROW_NUMBER within each group ordered by id ASC. Rows where rn > 1 are the duplicates to remove.
Given orders (order_id, customer_id, product_id) and customers (customer_id, segment), calculate the number of distinct products purchased by each customer segment. Then find which product is purchased by the most segments.
First query: COUNT(DISTINCT product_id) grouped by segment. Second query: COUNT(DISTINCT segment) grouped by product_id, ordered DESC, LIMIT 1. Two straightforward COUNT(DISTINCT) queries.
An events table has (user_id, event_type, event_timestamp). Events within 30 minutes of each other belong to the same session. Some events are exact duplicates (same user, type, and timestamp). First deduplicate exact duplicates, then assign session IDs, then return session_id, user_id, session_start, session_end, and event_count per session.
Step 1: DISTINCT or GROUP BY to remove exact dupes. Step 2: Use LAG(event_timestamp) to find gaps > 30 minutes. Step 3: SUM a flag (1 when gap > 30 min, 0 otherwise) as a running total to create session IDs. This is the gaps-and-islands technique.
A broken ETL job duplicated rows in the customers table. Each customer may have 1-5 rows with different updated_at timestamps. Keep only the most recent row per customer.
WITH ranked AS (
SELECT
*,
ROW_NUMBER() OVER (
PARTITION BY customer_id
ORDER BY updated_at DESC
) AS rn
FROM customers
)
SELECT customer_id, name, email, phone, updated_at
FROM ranked
WHERE rn = 1
ORDER BY customer_id;ROW_NUMBER assigns 1 to the most recent row within each customer_id group (because of ORDER BY updated_at DESC). Filtering WHERE rn = 1 keeps exactly one row per customer. This pattern works in every SQL dialect and gives you full control over which row survives. Unlike DISTINCT, you keep all columns from the winning row without needing to aggregate anything.
Expected output
customer_id | name | email | phone | updated_at
-------------+--------------+-------------------+------------+---------------------
1 | Alice Chen | alice@example.com | 5551234567 | 2024-03-15 09:22:00
2 | Bob Park | bob@newmail.com | 5559876543 | 2024-03-14 16:45:00
3 | Carol Jones | carol@work.io | 5555551212 | 2024-03-12 11:03:00Duplicate data shows up in every pipeline and every interview. Practice DISTINCT, ROW_NUMBER dedup, and COUNT(DISTINCT) with real SQL execution.