SQL DISTINCT: SELECT and COUNT Patterns
Most candidates reach for SELECT DISTINCT the moment they see duplicates. The interviewer is checking whether you understand that DISTINCT is usually a symptom, not a cure. Duplicates mean your join fanned out, your grain is wrong, or your GROUP BY is missing a column. Slapping DISTINCT on top hides the underlying bug and masks it through every downstream aggregation.
Three Forms of DISTINCT
Know DISTINCT the way the interviewer who asks it knows it.
SELECT DISTINCT
SELECT DISTINCT removes duplicate rows from the result set. Uniqueness is determined by all columns in the SELECT list. If you select three columns, two rows are duplicates only if all three column values match. Adding more columns to SELECT makes duplicates less likely. This is the most basic form of deduplication in SQL.
-- Unique departments
SELECT DISTINCT department
FROM employees;
-- Unique department + title combinations
SELECT DISTINCT department, job_title
FROM employees;
-- Two employees in the same department with different titles
-- are NOT duplicates because the title column differsInterview note: DISTINCT applies to the entire row, not just the first column. SELECT DISTINCT a, b deduplicates on the (a, b) pair, not just on a. This misconception appears in interviews regularly.
COUNT(DISTINCT)
COUNT(DISTINCT column) counts the number of unique non-NULL values in a column. It combines deduplication and counting in one step. This is more efficient than a subquery with SELECT DISTINCT followed by COUNT(*), because the engine can count during the dedup pass without materializing the intermediate result.
-- How many unique cities do our customers live in?
SELECT COUNT(DISTINCT city) AS unique_cities
FROM customers;
-- Per-department unique skill count
SELECT
department,
COUNT(DISTINCT skill) AS unique_skills,
COUNT(*) AS total_employees
FROM employee_skills
GROUP BY department;
-- Multiple COUNT(DISTINCT) in one query
SELECT
COUNT(DISTINCT customer_id) AS unique_customers,
COUNT(DISTINCT product_id) AS unique_products,
COUNT(*) AS total_orders
FROM orders;Interview note: COUNT(DISTINCT column) skips NULLs. If a column has 100 rows with 10 NULLs and 5 distinct non-NULL values, COUNT(DISTINCT column) returns 5. Some engines allow COUNT(DISTINCT col1, col2) for multi-column uniqueness; PostgreSQL does not (use a subquery with SELECT DISTINCT instead).
DISTINCT ON (PostgreSQL)
DISTINCT ON is a PostgreSQL extension that keeps one row per unique value of the specified columns. Unlike SELECT DISTINCT, which deduplicates on all columns, DISTINCT ON deduplicates on a subset and lets you control which row survives via ORDER BY. This is PostgreSQL's shorthand for the ROW_NUMBER dedup pattern.
-- Keep the most recent order per customer (PostgreSQL only)
SELECT DISTINCT ON (customer_id)
customer_id,
order_id,
order_date,
amount
FROM orders
ORDER BY customer_id, order_date DESC;
-- Equivalent using ROW_NUMBER (works everywhere)
SELECT * FROM (
SELECT *,
ROW_NUMBER() OVER (
PARTITION BY customer_id
ORDER BY order_date DESC
) AS rn
FROM orders
) sub
WHERE rn = 1;Interview note: DISTINCT ON is PostgreSQL-specific. If the interview is engine-agnostic, use the ROW_NUMBER pattern instead. But if they mention PostgreSQL specifically, DISTINCT ON is more concise and shows engine awareness. The ORDER BY must start with the DISTINCT ON columns.
DISTINCT vs GROUP BY
| Scenario | DISTINCT | GROUP BY |
|---|---|---|
| Get unique values | SELECT DISTINCT department FROM employees | SELECT department FROM employees GROUP BY department |
| Count per group | Cannot add COUNT(*) to SELECT DISTINCT without GROUP BY | SELECT department, COUNT(*) FROM employees GROUP BY department |
| Dedup with aggregation | Subquery: SELECT DISTINCT ... then outer GROUP BY | Single-pass GROUP BY with aggregates |
DISTINCT and NULL Values
Hand-Holding at Scale
The catalog is vast. Find what fraction of it is built for beginners.
Pulled from debriefs where SQL was the gate.
-- These two queries produce identical results
SELECT DISTINCT department FROM employees;
SELECT department FROM employees GROUP BY department;
-- But only GROUP BY supports aggregation
SELECT department, COUNT(*), AVG(salary)
FROM employees
GROUP BY department;
-- DISTINCT cannot do this without a subquery-- Table: departments (name column)
-- 'Engineering', 'Marketing', NULL, NULL, 'Sales'
SELECT DISTINCT name FROM departments;
-- Result: 'Engineering', 'Marketing', NULL, 'Sales'
-- Two NULLs collapsed to one
-- COUNT(DISTINCT) skips NULLs entirely
SELECT COUNT(DISTINCT name) FROM departments;
-- Result: 3 (Engineering, Marketing, Sales)
-- NULL is not countedDISTINCT Performance
- Avoid unnecessary DISTINCT: If the query already produces unique rows (e.g., selecting a primary key column), DISTINCT adds overhead for no benefit. Check whether your joins and filters already guarantee uniqueness before adding DISTINCT.
- Fewer columns means faster dedup: SELECT DISTINCT on 2 narrow columns is much faster than SELECT DISTINCT on 10 wide columns. The sort key size directly affects sort performance and memory usage. Only select the columns you actually need.
- DISTINCT is often a code smell: If you find yourself adding DISTINCT to fix duplicate rows from a JOIN, the root cause is likely a fan-out in the join (joining on a non-unique key). Fix the join logic instead of masking the duplicates with DISTINCT. Interviewers notice this pattern and will ask about it.
4 DISTINCT Interview Questions
Q1: What is the difference between DISTINCT and GROUP BY? When would you use each?
What they test: Conceptual understanding. When used for deduplication alone (no aggregates), SELECT DISTINCT and GROUP BY produce identical results. The difference is intent and capability. DISTINCT says 'I want unique rows.' GROUP BY says 'I want to aggregate per group.' Interviewers check whether you know that GROUP BY can do everything DISTINCT can, plus aggregation. Approach: For unique values only: prefer DISTINCT for readability. For unique values with aggregation: GROUP BY is required. Explain that under the hood, many query optimizers treat SELECT DISTINCT and GROUP BY with no aggregates identically. The execution plan is often the same.
Q2: How does SELECT DISTINCT handle NULLs?
What they test: NULL equality rules. In most SQL contexts, NULL != NULL. But for DISTINCT, NULLs are treated as equal. If a column has multiple NULL values, SELECT DISTINCT keeps only one NULL row. This is a special case of the SQL standard that contradicts the general NULL inequality rule. The interviewer checks if you know this exception. Approach: State that DISTINCT treats NULLs as equal for deduplication purposes. Multiple NULLs collapse into one. This applies to SELECT DISTINCT, UNION (which uses DISTINCT logic), and GROUP BY. The standard defines this behavior explicitly.
Q3: A query with SELECT DISTINCT on 5 columns runs slowly. How would you investigate and optimize?
What they test: Performance troubleshooting. DISTINCT requires sorting or hashing the entire result set across all selected columns. Five wide columns means a large sort key. The interviewer wants to hear about checking the execution plan, asking whether all 5 columns are actually needed, considering indexes, and whether the dedup can be pushed closer to the source (like deduping on a key column first). Approach: First: check the execution plan for Sort or Hash Aggregate nodes. Second: reduce columns if possible (DISTINCT on 2 columns is cheaper than 5). Third: add an index if the DISTINCT columns are a common access pattern. Fourth: consider whether the data source already guarantees uniqueness (making DISTINCT unnecessary). Fifth: for very large tables, materialized views or pre-aggregated tables may be better.
Q4: Write a query to find customers who have ordered from at least 3 different product categories.
What they test: Combining COUNT(DISTINCT) with HAVING. The query needs to count distinct categories per customer and filter to those with 3+. This tests whether the candidate knows to use COUNT(DISTINCT category) inside a GROUP BY with a HAVING clause. Approach: SELECT customer_id FROM orders GROUP BY customer_id HAVING COUNT(DISTINCT category) >= 3. The DISTINCT inside COUNT is essential. Without it, a customer who ordered 3 items from the same category would falsely qualify. Mention that COUNT(DISTINCT) skips NULLs, so orders without a category would not be counted.
Frequently asked questions
What does DISTINCT do in SQL?+
What is the difference between DISTINCT and GROUP BY?+
Does DISTINCT treat NULLs as equal?+
What is DISTINCT ON in PostgreSQL?+
Stop Reaching for DISTINCT First
- 01
Active recall beats re-reading by 50%
Cognitive-science meta-reviews (Dunlosky et al., 2013) rank practice testing as a top-tier study technique, while re-reading and highlighting rank near the bottom
- 02
76% of hiring managers reject on the coding task, not the resume
From HackerRank's 2024 Developer Skills Report. Candidates who look strong on paper still fail the live screen if they haven't done timed, executable practice
- 03
Five problem shapes cover 80% of data engineer loops
Dedup, sessionization, top-N-per-group, slowly-changing dimensions, partition tricks. Writing the shapes by hand turns the unfamiliar into pattern recognition