Most candidates reach for SELECT DISTINCT the moment they see a duplicate row. That's almost always wrong. In more than 60% of the senior interview rounds we've analyzed, a DISTINCT mistake gets flagged within the first five minutes. The usual pattern: candidate slaps DISTINCT on the outer query to hide a JOIN fan-out, the interviewer asks about row counts, and the answer unravels.
DISTINCT isn't a debugging tool. It's an assertion that every row in your result is already guaranteed unique on the projected columns. If you can't explain why the underlying query would ever produce duplicates, you don't need DISTINCT. You need to fix the JOIN.
Senior rounds flag misuse
Until it gets caught
Good reasons to hide a JOIN bug
Rounds use INNER JOIN
Source: DataDriven analysis of 1,042 verified data engineering interview rounds.
Candidates love to think DISTINCT scopes to the first column. It doesn't. It deduplicates on the full tuple of projected columns, which means SELECT DISTINCT a, b is asking for unique (a, b) pairs, not unique a. If you want unique values in one column while still pulling others, you're probably reaching for GROUP BY or a window function, and you should name that choice out loud when an interviewer is watching.
SELECT DISTINCT column1, column2, ...
FROM table_name
[WHERE condition]
[ORDER BY column1];Key point: DISTINCT deduplicates on the entire row. SELECT DISTINCT a, b treats (a, b) as the unit of uniqueness. Two rows with the same value of a but different values of b are not duplicates.
Four forms of DISTINCT that cover the vast majority of use cases.
SELECT DISTINCT on a single column returns all unique values in that column. Duplicate values are collapsed to one row. NULL values are treated as a single distinct value: if three rows have NULL in the column, one NULL row appears in the result.
-- All unique departments
SELECT DISTINCT department
FROM employees;
-- Result: Engineering, Marketing, Sales, HR
-- Even if 200 rows have department = 'Engineering',
-- it appears once in the outputSELECT DISTINCT with multiple columns deduplicates based on the combination of all listed columns. Two rows are duplicates only if every column matches. Adding more columns to the SELECT makes duplicates less likely. This is the most misunderstood behavior of DISTINCT.
-- Unique department + job_title combinations
SELECT DISTINCT department, job_title
FROM employees;
-- Two engineers with different titles are NOT duplicates
-- ('Engineering', 'Senior Engineer') and
-- ('Engineering', 'Staff Engineer') are two rows
-- DISTINCT applies to the ENTIRE row, not just the first columnCOUNT(DISTINCT column) counts unique non-NULL values in one pass. It is more efficient than a subquery with SELECT DISTINCT followed by COUNT(*) because the engine can count during deduplication without materializing an intermediate result set.
-- How many unique cities?
SELECT COUNT(DISTINCT city) AS unique_cities
FROM customers;
-- Per-department unique skill count
SELECT
department,
COUNT(DISTINCT skill) AS unique_skills
FROM employee_skills
GROUP BY department;
-- NULLs are NOT counted by COUNT(DISTINCT)You can combine DISTINCT with ORDER BY. The ORDER BY columns must appear in the SELECT list when using DISTINCT (in most engines). This constraint exists because DISTINCT operates on the SELECT columns, and ordering by a column not in the output would be ambiguous.
-- Unique departments, sorted alphabetically
SELECT DISTINCT department
FROM employees
ORDER BY department ASC;
-- This FAILS in PostgreSQL and most engines:
-- SELECT DISTINCT department FROM employees ORDER BY hire_date;
-- ERROR: ORDER BY expression must appear in select listThis comparison is one of the most frequently asked SQL interview questions. The short answer: for simple deduplication, they produce identical results. GROUP BY becomes necessary when you need aggregate functions.
| Aspect | DISTINCT | GROUP BY |
|---|---|---|
| Purpose | Remove duplicate rows from the result set. | Aggregate rows into groups for calculation. |
| Aggregation | No aggregation. Returns raw deduplicated rows. | Requires aggregate functions (SUM, COUNT, AVG, etc.). |
| Result | SELECT DISTINCT col1, col2 returns unique (col1, col2) pairs. | SELECT col1, col2, COUNT(*) FROM t GROUP BY col1, col2 returns the same pairs plus a count. |
| Performance | Typically similar to GROUP BY on the same columns. Both require sorting or hashing. | Same cost for deduplication. Extra cost only if computing aggregates. |
| When equivalent | SELECT DISTINCT col1 FROM t | SELECT col1 FROM t GROUP BY col1 (identical result) |
-- These two queries return IDENTICAL results:
SELECT DISTINCT department FROM employees;
SELECT department FROM employees GROUP BY department;
-- GROUP BY is needed when you want aggregation:
SELECT department, COUNT(*) AS emp_count
FROM employees
GROUP BY department;
-- DISTINCT cannot do thisDISTINCT is not free. Understanding when it helps and when it hurts is what separates a junior query from a senior one.
The database engine has two strategies for deduplication: sorting and hashing. With sorting, the engine sorts all rows by the DISTINCT columns and then scans sequentially, emitting each row only when it differs from the previous one. With hashing, the engine builds a hash table of seen values and skips rows that already exist in the table. The query optimizer chooses the strategy based on data size, available memory, and existing indexes.
If an index exists on the DISTINCT column(s), the engine can use an index-only scan to retrieve unique values without reading the full table. This is significantly faster for large tables. In PostgreSQL, a B-tree index on department lets SELECT DISTINCT department use an Index Only Scan, reading only the index pages instead of the full table.
DISTINCT is expensive when the number of unique values is close to the total number of rows. If 90% of rows are unique, DISTINCT does a lot of work (sorting or hashing millions of rows) and removes very few duplicates. In this case, consider whether DISTINCT is necessary. Often, DISTINCT is used as a band-aid for a query that produces unintended duplicates due to a missing join condition or an incorrect GROUP BY.
A common anti-pattern is using DISTINCT to remove duplicates caused by a JOIN. If you join orders to order_items and only want unique orders, SELECT DISTINCT o.* is wasteful because it joins all items then deduplicates. Using EXISTS or IN is faster because the engine stops after finding the first match per order.
These questions test understanding beyond the basic syntax. Each one probes a common misconception or debugging skill.
What they test:
Conceptual clarity. Many candidates think DISTINCT and GROUP BY are completely different operations. They want to hear that SELECT DISTINCT col is equivalent to SELECT col GROUP BY col without aggregation. GROUP BY becomes necessary when you need aggregate functions. Use DISTINCT for simple deduplication. Use GROUP BY for aggregation.
Approach:
State that they produce identical results for simple deduplication. Then explain when they differ: GROUP BY allows aggregate functions (SUM, COUNT, AVG), while DISTINCT does not. Show an example where both work and one where only GROUP BY works (when you need a COUNT alongside the distinct values).
What they test:
Debugging skills. The interviewer does not want DISTINCT as the answer because it masks the root cause. The duplicates are caused by a one-to-many join creating multiple rows. The fix is to either change the join condition, use a subquery with aggregation before joining, or use EXISTS instead of JOIN.
Approach:
Diagnose the root cause: the join produces a Cartesian product because the join key is not unique on one side. Fix options: aggregate the many-side table before joining, use a DISTINCT ON or ROW_NUMBER to pick one row per key, or replace the JOIN with EXISTS if you only need to filter, not access columns from the joined table.
What they test:
Edge case knowledge. NULLs are treated as equal to each other for the purpose of DISTINCT. If five rows have NULL in a column, DISTINCT collapses them to one NULL row. This differs from comparison behavior where NULL != NULL. The interviewer wants to see that you know this special case.
Approach:
State clearly: DISTINCT treats all NULLs as one value. Two rows with NULL in the same column are considered duplicates by DISTINCT. This is an exception to the general SQL rule that NULL is not equal to NULL. COUNT(DISTINCT column) still excludes NULLs entirely, so the behavior is different between SELECT DISTINCT and COUNT(DISTINCT).
Interviewers don't penalize you for not knowing DISTINCT. They penalize you for reaching for it when the real bug is a fan-out upstream. Practice catching the cause, not masking it.