SELECT DISTINCT in SQL

Most candidates reach for SELECT DISTINCT the moment they see a duplicate row. That's almost always wrong. In more than 60% of the senior interview rounds we've analyzed, a DISTINCT mistake gets flagged within the first five minutes. The usual pattern: candidate slaps DISTINCT on the outer query to hide a JOIN fan-out, the interviewer asks about row counts, and the answer unravels.

60%+

Senior rounds flag misuse

5 min

Until it gets caught

Good reasons to hide a JOIN bug

29%

Rounds use INNER JOIN

Source: DataDriven analysis of 1,042 verified data engineering interview rounds.

Prepare for the interview

01 / Open invite

02min.

Know SELECT DISTINCT the way the interviewer who asks it knows it.

a SELECT DISTINCT query, the same shape a screen would give you.

The diff against expected. Where ties broke. What you missed.

sandbox

1SELECT user_id,

2 COUNT(*) AS sessions

3FROM events

4WHERE ts >= NOW() - INTERVAL '7 day'

Execute your solution0.4s avg.

ShopifyInterview question

Solve a SELECT DISTINCT problem

SELECT DISTINCT Syntax

Candidates love to think DISTINCT scopes to the first column. It doesn't. It deduplicates on the full tuple of projected columns, which means SELECT DISTINCT a, b is asking for unique (a, b) pairs, not unique a. If you want unique values in one column while still pulling others, you're probably reaching for GROUP BY or a window function, and you should name that choice out loud when an interviewer is watching.

SELECT DISTINCT column1, column2, ...
FROM table_name
[WHERE condition]
[ORDER BY column1];

Key point: DISTINCT deduplicates on the entire row. SELECT DISTINCT a, b treats (a, b) as the unit of uniqueness. Two rows with the same value of a but different values of b are not duplicates.

Content Type Distribution

> The editorial team wants a quick census of the content library. How many items exist for each content type? List them alphabetically.

SELECT DISTINCT Examples

Four forms of DISTINCT that cover the vast majority of use cases.

Single Column

SELECT DISTINCT on a single column returns all unique values in that column. Duplicate values are collapsed to one row. NULL values are treated as a single distinct value: if three rows have NULL in the column, one NULL row appears in the result.

-- All unique departments
SELECT DISTINCT department
FROM employees;

-- Result: Engineering, Marketing, Sales, HR
-- Even if 200 rows have department = 'Engineering',
-- it appears once in the output

Multiple Columns

SELECT DISTINCT with multiple columns deduplicates based on the combination of all listed columns. Two rows are duplicates only if every column matches. Adding more columns to the SELECT makes duplicates less likely. This is the most misunderstood behavior of DISTINCT.

-- Unique department + job_title combinations
SELECT DISTINCT department, job_title
FROM employees;

-- Two engineers with different titles are NOT duplicates
-- ('Engineering', 'Senior Engineer') and
-- ('Engineering', 'Staff Engineer') are two rows

-- DISTINCT applies to the ENTIRE row, not just the first column

COUNT(DISTINCT)

COUNT(DISTINCT column) counts unique non-NULL values in one pass. It is more efficient than a subquery with SELECT DISTINCT followed by COUNT(*) because the engine can count during deduplication without materializing an intermediate result set.

-- How many unique cities?
SELECT COUNT(DISTINCT city) AS unique_cities
FROM customers;

-- Per-department unique skill count
SELECT
  department,
  COUNT(DISTINCT skill) AS unique_skills
FROM employee_skills
GROUP BY department;

-- NULLs are NOT counted by COUNT(DISTINCT)

DISTINCT with ORDER BY

You can combine DISTINCT with ORDER BY. The ORDER BY columns must appear in the SELECT list when using DISTINCT (in most engines). This constraint exists because DISTINCT operates on the SELECT columns, and ordering by a column not in the output would be ambiguous.

-- Unique departments, sorted alphabetically
SELECT DISTINCT department
FROM employees
ORDER BY department ASC;

-- This FAILS in PostgreSQL and most engines:
-- SELECT DISTINCT department FROM employees ORDER BY hire_date;
-- ERROR: ORDER BY expression must appear in select list

DISTINCT vs GROUP BY

This comparison is one of the most frequently asked SQL interview questions. The short answer: for simple deduplication, they produce identical results. GROUP BY becomes necessary when you need aggregate functions.

Aspect	DISTINCT	GROUP BY
Purpose	Remove duplicate rows from the result set.	Aggregate rows into groups for calculation.
Aggregation	No aggregation. Returns raw deduplicated rows.	Requires aggregate functions (SUM, COUNT, AVG, etc.).
Result	SELECT DISTINCT col1, col2 returns unique (col1, col2) pairs.	SELECT col1, col2, COUNT(*) FROM t GROUP BY col1, col2 returns the same pairs plus a count.
Performance	Typically similar to GROUP BY on the same columns. Both require sorting or hashing.	Same cost for deduplication. Extra cost only if computing aggregates.
When equivalent	SELECT DISTINCT col1 FROM t	SELECT col1 FROM t GROUP BY col1 (identical result)

-- These two queries return IDENTICAL results:
SELECT DISTINCT department FROM employees;
SELECT department FROM employees GROUP BY department;

-- GROUP BY is needed when you want aggregation:
SELECT department, COUNT(*) AS emp_count
FROM employees
GROUP BY department;
-- DISTINCT cannot do this

Performance Considerations

DISTINCT is not free. Understanding when it helps and when it hurts is what separates a junior query from a senior one.

How DISTINCT Works Under the Hood

The database engine has two strategies for deduplication: sorting and hashing. With sorting, the engine sorts all rows by the DISTINCT columns and then scans sequentially, emitting each row only when it differs from the previous one. With hashing, the engine builds a hash table of seen values and skips rows that already exist in the table. The query optimizer chooses the strategy based on data size, available memory, and existing indexes.

Indexes and DISTINCT

If an index exists on the DISTINCT column(s), the engine can use an index-only scan to retrieve unique values without reading the full table. This is significantly faster for large tables. In PostgreSQL, a B-tree index on department lets SELECT DISTINCT department use an Index Only Scan, reading only the index pages instead of the full table.

DISTINCT on Large Result Sets

DISTINCT is expensive when the number of unique values is close to the total number of rows. If 90% of rows are unique, DISTINCT does a lot of work (sorting or hashing millions of rows) and removes very few duplicates. In this case, consider whether DISTINCT is necessary. Often, DISTINCT is used as a band-aid for a query that produces unintended duplicates due to a missing join condition or an incorrect GROUP BY.

DISTINCT vs EXISTS for Semi-Joins

A common anti-pattern is using DISTINCT to remove duplicates caused by a JOIN. If you join orders to order_items and only want unique orders, SELECT DISTINCT o.* is wasteful because it joins all items then deduplicates. Using EXISTS or IN is faster because the engine stops after finding the first match per order.

3 SELECT DISTINCT Interview Questions

Q1: What is the difference between SELECT DISTINCT and GROUP BY? When would you use each?

What they test: Conceptual clarity. Many candidates think DISTINCT and GROUP BY are completely different operations. They want to hear that SELECT DISTINCT col is equivalent to SELECT col GROUP BY col without aggregation. GROUP BY becomes necessary when you need aggregate functions. Use DISTINCT for simple deduplication. Use GROUP BY for aggregation. Approach: State that they produce identical results for simple deduplication. Then explain when they differ: GROUP BY allows aggregate functions (SUM, COUNT, AVG), while DISTINCT does not. Show an example where both work and one where only GROUP BY works (when you need a COUNT alongside the distinct values).

Q2: A query returns duplicate rows after adding a JOIN. How do you fix it without using DISTINCT?

What they test: Debugging skills. The interviewer does not want DISTINCT as the answer because it masks the root cause. The duplicates are caused by a one-to-many join creating multiple rows. The fix is to either change the join condition, use a subquery with aggregation before joining, or use EXISTS instead of JOIN. Approach: Diagnose the root cause: the join produces a Cartesian product because the join key is not unique on one side. Fix options: aggregate the many-side table before joining, use a DISTINCT ON or ROW_NUMBER to pick one row per key, or replace the JOIN with EXISTS if you only need to filter, not access columns from the joined table.

Q3: How does SELECT DISTINCT handle NULL values?

What they test: Edge case knowledge. NULLs are treated as equal to each other for the purpose of DISTINCT. If five rows have NULL in a column, DISTINCT collapses them to one NULL row. This differs from comparison behavior where NULL != NULL. The interviewer wants to see that you know this special case. Approach: State clearly: DISTINCT treats all NULLs as one value. Two rows with NULL in the same column are considered duplicates by DISTINCT. This is an exception to the general SQL rule that NULL is not equal to NULL. COUNT(DISTINCT column) still excludes NULLs entirely, so the behavior is different between SELECT DISTINCT and COUNT(DISTINCT).

Frequently asked questions

What does SELECT DISTINCT do in SQL?+

SELECT DISTINCT removes duplicate rows from the query result set. It evaluates all columns in the SELECT list and returns only rows where the combination of values is unique. If you SELECT DISTINCT department, job_title, two rows are duplicates only if both department AND job_title match. DISTINCT applies to the entire row, not just the first column. It works in every SQL database: PostgreSQL, MySQL, SQL Server, Snowflake, BigQuery, and others.

Does SELECT DISTINCT affect performance?+

Yes. DISTINCT requires the engine to sort or hash the result set to identify duplicates. For small result sets, the overhead is negligible. For large result sets (millions of rows), it can be significant. If most rows are already unique, DISTINCT does expensive work for minimal benefit. Always check whether DISTINCT is actually needed or if it is masking an underlying query issue like a bad join. An index on the DISTINCT columns can dramatically improve performance.

How does DISTINCT work with multiple columns?+

DISTINCT evaluates all columns in the SELECT list together. SELECT DISTINCT col1, col2 returns unique (col1, col2) pairs. Two rows are duplicates only if both col1 and col2 match. Adding a third column (col1, col2, col3) means all three must match for a row to be considered a duplicate. This means adding more columns to SELECT DISTINCT typically returns more rows, not fewer.

02 / Why practice

Stop patching. Start diagnosing.

01
Active recall beats re-reading by 50%
Cognitive-science meta-reviews (Dunlosky et al., 2013) rank practice testing as a top-tier study technique, while re-reading and highlighting rank near the bottom
02
76% of hiring managers reject on the coding task, not the resume
From HackerRank's 2024 Developer Skills Report. Candidates who look strong on paper still fail the live screen if they haven't done timed, executable practice
03
Five problem shapes cover 80% of data engineer loops
Dedup, sessionization, top-N-per-group, slowly-changing dimensions, partition tricks. Writing the shapes by hand turns the unfamiliar into pattern recognition

Try a SQL Problem