SQL Practice

SQL DISTINCT Explained

Most candidates reach for SELECT DISTINCT the moment they see duplicates. The interviewer is checking whether you understand that DISTINCT is usually a symptom, not a cure. Duplicates mean your join fanned out, your grain is wrong, or your GROUP BY is missing a column. Slapping DISTINCT on top hides the underlying bug and masks it through every downstream aggregation.

GROUP BY appears in 32% of verified DE SQL rounds, and a meaningful share of DISTINCT questions in our corpus are really GROUP BY questions in disguise. This page covers all three DISTINCT forms, the GROUP BY comparison, and the phone-screen moments where picking DISTINCT first is the tell that sinks the candidate.

32%

GROUP BY is the real answer

#1

Most misused SQL keyword

3

DISTINCT forms to know

4s

Interviewers spot the misuse

Source: DataDriven analysis of 1,042 verified data engineering interview rounds.

Three Forms of DISTINCT

Three forms, three intents, and interviewers watch which one you reach for. SELECT DISTINCT deduplicates whole rows and usually signals a join mistake. COUNT(DISTINCT) counts unique values and is the legitimate use of the keyword. DISTINCT ON picks one row per group and is the Postgres-only shorthand that rewards candidates who know their engine.

SELECT DISTINCT

SELECT DISTINCT removes duplicate rows from the result set. Uniqueness is determined by all columns in the SELECT list. If you select three columns, two rows are duplicates only if all three column values match. Adding more columns to SELECT makes duplicates less likely. This is the most basic form of deduplication in SQL.

-- Unique departments
SELECT DISTINCT department
FROM employees;

-- Unique department + title combinations
SELECT DISTINCT department, job_title
FROM employees;

-- Two employees in the same department with different titles
-- are NOT duplicates because the title column differs

Interview note: DISTINCT applies to the entire row, not just the first column. SELECT DISTINCT a, b deduplicates on the (a, b) pair, not just on a. This misconception appears in interviews regularly.

COUNT(DISTINCT)

COUNT(DISTINCT column) counts the number of unique non-NULL values in a column. It combines deduplication and counting in one step. This is more efficient than a subquery with SELECT DISTINCT followed by COUNT(*), because the engine can count during the dedup pass without materializing the intermediate result.

-- How many unique cities do our customers live in?
SELECT COUNT(DISTINCT city) AS unique_cities
FROM customers;

-- Per-department unique skill count
SELECT
  department,
  COUNT(DISTINCT skill) AS unique_skills,
  COUNT(*) AS total_employees
FROM employee_skills
GROUP BY department;

-- Multiple COUNT(DISTINCT) in one query
SELECT
  COUNT(DISTINCT customer_id) AS unique_customers,
  COUNT(DISTINCT product_id) AS unique_products,
  COUNT(*) AS total_orders
FROM orders;

Interview note: COUNT(DISTINCT column) skips NULLs. If a column has 100 rows with 10 NULLs and 5 distinct non-NULL values, COUNT(DISTINCT column) returns 5. Some engines allow COUNT(DISTINCT col1, col2) for multi-column uniqueness; PostgreSQL does not (use a subquery with SELECT DISTINCT instead).

DISTINCT ON (PostgreSQL)

DISTINCT ON is a PostgreSQL extension that keeps one row per unique value of the specified columns. Unlike SELECT DISTINCT, which deduplicates on all columns, DISTINCT ON deduplicates on a subset and lets you control which row survives via ORDER BY. This is PostgreSQL's shorthand for the ROW_NUMBER dedup pattern.

-- Keep the most recent order per customer (PostgreSQL only)
SELECT DISTINCT ON (customer_id)
  customer_id,
  order_id,
  order_date,
  amount
FROM orders
ORDER BY customer_id, order_date DESC;

-- Equivalent using ROW_NUMBER (works everywhere)
SELECT * FROM (
  SELECT *,
    ROW_NUMBER() OVER (
      PARTITION BY customer_id
      ORDER BY order_date DESC
    ) AS rn
  FROM orders
) sub
WHERE rn = 1;

Interview note: DISTINCT ON is PostgreSQL-specific. If the interview is engine-agnostic, use the ROW_NUMBER pattern instead. But if they mention PostgreSQL specifically, DISTINCT ON is more concise and shows engine awareness. The ORDER BY must start with the DISTINCT ON columns.

DISTINCT vs GROUP BY

This comparison comes up in nearly every SQL interview that covers fundamentals. The short version: GROUP BY can do everything DISTINCT does, plus aggregation. DISTINCT is syntactic sugar for the dedup-only case.

ScenarioDISTINCTGROUP BY
Get unique valuesSELECT DISTINCT department FROM employeesSELECT department FROM employees GROUP BY department
Count per groupCannot add COUNT(*) to SELECT DISTINCT without GROUP BYSELECT department, COUNT(*) FROM employees GROUP BY department
Dedup with aggregationSubquery: SELECT DISTINCT ... then outer GROUP BYSingle-pass GROUP BY with aggregates
-- These two queries produce identical results
SELECT DISTINCT department FROM employees;
SELECT department FROM employees GROUP BY department;

-- But only GROUP BY supports aggregation
SELECT department, COUNT(*), AVG(salary)
FROM employees
GROUP BY department;

-- DISTINCT cannot do this without a subquery

DISTINCT and NULL Values

Here is the rule that surprises people: for DISTINCT, NULLs are treated as equal. In standard SQL, NULL = NULL evaluates to UNKNOWN (not TRUE). But DISTINCT makes an exception and collapses multiple NULLs into one. This same exception applies to UNION, GROUP BY, and INTERSECT.

-- Table: departments (name column)
-- 'Engineering', 'Marketing', NULL, NULL, 'Sales'

SELECT DISTINCT name FROM departments;
-- Result: 'Engineering', 'Marketing', NULL, 'Sales'
-- Two NULLs collapsed to one

-- COUNT(DISTINCT) skips NULLs entirely
SELECT COUNT(DISTINCT name) FROM departments;
-- Result: 3 (Engineering, Marketing, Sales)
-- NULL is not counted

Key distinction: SELECT DISTINCT keeps one NULL row in the output. COUNT(DISTINCT) removes NULLs entirely from the count. These two behaviors are consistent with their respective purposes: DISTINCT shows unique values (including NULL as a value), while COUNT counts non-NULL values.

DISTINCT Performance

DISTINCT forces the engine to sort or hash the result set to identify duplicates. For small result sets, this is negligible. For large ones, it can dominate query time.

Avoid unnecessary DISTINCT

If the query already produces unique rows (e.g., selecting a primary key column), DISTINCT adds overhead for no benefit. Check whether your joins and filters already guarantee uniqueness before adding DISTINCT.

Fewer columns means faster dedup

SELECT DISTINCT on 2 narrow columns is much faster than SELECT DISTINCT on 10 wide columns. The sort key size directly affects sort performance and memory usage. Only select the columns you actually need.

DISTINCT is often a code smell

If you find yourself adding DISTINCT to fix duplicate rows from a JOIN, the root cause is likely a fan-out in the join (joining on a non-unique key). Fix the join logic instead of masking the duplicates with DISTINCT. Interviewers notice this pattern and will ask about it.

4 DISTINCT Interview Questions

These questions test deduplication mechanics, NULL behavior, and the DISTINCT vs GROUP BY comparison.

Q1: What is the difference between DISTINCT and GROUP BY? When would you use each?

What they test:

Conceptual understanding. When used for deduplication alone (no aggregates), SELECT DISTINCT and GROUP BY produce identical results. The difference is intent and capability. DISTINCT says 'I want unique rows.' GROUP BY says 'I want to aggregate per group.' Interviewers check whether you know that GROUP BY can do everything DISTINCT can, plus aggregation.

Approach:

For unique values only: prefer DISTINCT for readability. For unique values with aggregation: GROUP BY is required. Explain that under the hood, many query optimizers treat SELECT DISTINCT and GROUP BY with no aggregates identically. The execution plan is often the same.

Q2: How does SELECT DISTINCT handle NULLs?

What they test:

NULL equality rules. In most SQL contexts, NULL != NULL. But for DISTINCT, NULLs are treated as equal. If a column has multiple NULL values, SELECT DISTINCT keeps only one NULL row. This is a special case of the SQL standard that contradicts the general NULL inequality rule. The interviewer checks if you know this exception.

Approach:

State that DISTINCT treats NULLs as equal for deduplication purposes. Multiple NULLs collapse into one. This applies to SELECT DISTINCT, UNION (which uses DISTINCT logic), and GROUP BY. The standard defines this behavior explicitly.

Q3: A query with SELECT DISTINCT on 5 columns runs slowly. How would you investigate and optimize?

What they test:

Performance troubleshooting. DISTINCT requires sorting or hashing the entire result set across all selected columns. Five wide columns means a large sort key. The interviewer wants to hear about checking the execution plan, asking whether all 5 columns are actually needed, considering indexes, and whether the dedup can be pushed closer to the source (like deduping on a key column first).

Approach:

First: check the execution plan for Sort or Hash Aggregate nodes. Second: reduce columns if possible (DISTINCT on 2 columns is cheaper than 5). Third: add an index if the DISTINCT columns are a common access pattern. Fourth: consider whether the data source already guarantees uniqueness (making DISTINCT unnecessary). Fifth: for very large tables, materialized views or pre-aggregated tables may be better.

Q4: Write a query to find customers who have ordered from at least 3 different product categories.

What they test:

Combining COUNT(DISTINCT) with HAVING. The query needs to count distinct categories per customer and filter to those with 3+. This tests whether the candidate knows to use COUNT(DISTINCT category) inside a GROUP BY with a HAVING clause.

Approach:

SELECT customer_id FROM orders GROUP BY customer_id HAVING COUNT(DISTINCT category) >= 3. The DISTINCT inside COUNT is essential. Without it, a customer who ordered 3 items from the same category would falsely qualify. Mention that COUNT(DISTINCT) skips NULLs, so orders without a category would not be counted.

DISTINCT FAQ

What does DISTINCT do in SQL?+
DISTINCT removes duplicate rows from a result set. SELECT DISTINCT evaluates uniqueness across all columns in the SELECT list. Two rows are considered duplicates only if every selected column value matches. DISTINCT is applied after the query processes FROM, WHERE, GROUP BY, and HAVING, but before ORDER BY and LIMIT.
What is the difference between DISTINCT and GROUP BY?+
When used without aggregate functions, DISTINCT and GROUP BY produce identical results: one row per unique combination of values. The difference is that GROUP BY supports aggregate functions (COUNT, SUM, AVG, etc.) while DISTINCT alone does not. Use DISTINCT when you only need unique rows. Use GROUP BY when you need unique rows plus calculations per group. Most optimizers generate the same execution plan for both when no aggregates are involved.
Does DISTINCT treat NULLs as equal?+
Yes. For deduplication purposes, DISTINCT treats all NULLs as equal. Multiple rows with NULL in a column collapse to one row. This is an exception to the general SQL rule that NULL != NULL. The same behavior applies to UNION, GROUP BY, and INTERSECT. COUNT(DISTINCT column) still skips NULLs entirely.
What is DISTINCT ON in PostgreSQL?+
DISTINCT ON (columns) keeps one row per unique combination of the specified columns. Unlike SELECT DISTINCT, which requires all selected columns to match for dedup, DISTINCT ON deduplicates on a subset. The ORDER BY clause determines which row survives for each group (the first row in the sort order). It is PostgreSQL-specific and equivalent to the ROW_NUMBER() OVER (PARTITION BY ... ORDER BY ...) pattern.

Stop Reaching for DISTINCT First

The next time you type SELECT DISTINCT, ask why the duplicates exist. Run the query without DISTINCT, find the real grain, and write the JOIN that doesn't fan out. The interviewer notices every time.