SQL UNION vs UNION ALL Explained

An Airflow DAG at a streaming startup crossed its 15-minute SLA every morning for six weeks before anyone looked at the query. The culprit was a plain UNION across 14 regional shards that was sort-deduping 280M rows the pipeline had already guaranteed were unique. Swapping UNION for UNION ALL dropped the stage from 11 minutes to 40 seconds. One word. Six weeks of pager pain.

90%

Pipeline UNIONs should be ALL

280M

Rows sort-deduped needlessly

11m

Stage runtime before fix

40s

After the one-word fix

Source: DataDriven analysis of 1,042 verified data engineering interview rounds.

Prepare for the interview

01 / Open invite

02min.

Know UNION vs UNION ALL the way the interviewer who asks it knows it.

a UNION vs UNION ALL query, the same shape a screen would give you.

The diff against expected. Where ties broke. What you missed.

sandbox

1SELECT user_id,

2 COUNT(*) AS sessions

3FROM events

4WHERE ts >= NOW() - INTERVAL '7 day'

Execute your solution0.4s avg.

DoorDashInterview question

Solve a UNION vs UNION ALL problem

Syntax

Both operators slot between SELECTs with a strict contract: same column count, compatible types, same positional order. The first SELECT's column names win. Skip that rule once and you'll get an error loud enough to spot. Skip the UNION vs UNION ALL choice and your pipeline just runs slower every night until somebody notices.

-- UNION: removes duplicates
SELECT region, status FROM orders
UNION
SELECT region, status FROM orders;

-- UNION ALL: keeps all rows
SELECT region, status FROM orders
UNION ALL
SELECT region, status FROM orders;

-- Multiple UNIONs (can mix UNION and UNION ALL)
SELECT user_id, username FROM users WHERE account_status = 'active'
UNION ALL
SELECT user_id, username FROM users WHERE account_status = 'inactive'
UNION ALL
SELECT user_id, username FROM users WHERE account_status = 'suspended';

Interview note: ORDER BY can only appear once, at the very end, after all UNION/UNION ALL operations. It sorts the final combined result. Putting ORDER BY inside an individual SELECT (without wrapping it in a subquery) produces a syntax error in most engines.

UNION vs UNION ALL Comparison

Every difference flows from the deduplication step. UNION pays the cost; UNION ALL does not.

Feature	UNION	UNION ALL
Duplicate handling	Removes duplicate rows across both result sets	Keeps every row from both result sets, duplicates included
Sorting	Implicitly sorts to detect duplicates (or uses hash-based dedup)	No sorting or hashing needed
Performance	Slower due to dedup overhead (sort or hash)	Faster because it just concatenates results
Row count	Less than or equal to the sum of both queries	Exactly the sum of both queries
Use case	When duplicates between sets must be eliminated	When you want all rows or know there are no duplicates

Concrete Example

-- Table A: (1, 'Alice'), (2, 'Bob')
-- Table B: (2, 'Bob'), (3, 'Carol')

-- UNION result: 3 rows
-- (1, 'Alice'), (2, 'Bob'), (3, 'Carol')
-- Bob appears once because UNION deduplicates

-- UNION ALL result: 4 rows
-- (1, 'Alice'), (2, 'Bob'), (2, 'Bob'), (3, 'Carol')
-- Bob appears twice, once from each table

UNION compares entire rows to detect duplicates. If even one column differs (like a timestamp), the rows are considered distinct and both are kept. This is why UNION sometimes fails to deduplicate the way people expect: it does exact row matching, not key-based matching.

Chat Activity

> The community team is evaluating channel health and wants to know which chat channels are active and which are dead weight. For each channel, show the total number of messages, the number of unique senders, and the percentage of messages that have been edited. Round the edited percentage to one decimal place. Only include channels with at least five messages, sorted from most to fewest.

Performance Implications

The performance gap between UNION and UNION ALL grows linearly with data volume. For small result sets (hundreds of rows), the difference is negligible. For millions of rows, it is substantial.

UNION overhead: sort or hash. The engine must build a data structure (sorted list or hash table) containing every row from the combined output. For a result set of N rows, this costs O(N log N) for sort-based dedup or O(N) for hash-based dedup, plus the memory to hold the structure. If the result does not fit in memory, it spills to disk.
UNION ALL overhead: zero. UNION ALL streams rows from each SELECT sequentially. No buffering, no sorting, no hashing. The cost is simply the sum of executing each individual SELECT. For pipeline queries that combine partitioned tables, this is the correct choice 90%+ of the time.
Pipeline rule of thumb: Use UNION ALL by default. Only switch to UNION when you have confirmed that cross-source duplicates exist and must be eliminated. In many pipelines, each source has unique primary keys, making UNION unnecessary and wasteful.

Common Patterns

Four patterns that cover the most common UNION and UNION ALL scenarios in both interviews and production pipelines.

Combining Tables with Identical Schema

The most common UNION ALL use case in data engineering: stacking tables that share the same schema. Regional tables, monthly partitions, or source-specific staging tables often need to be combined into one unified dataset. Since each source table has unique rows, UNION ALL is correct and faster than UNION.

-- Combine regional order subsets
SELECT order_id, status, profit, 'US' AS region
FROM orders WHERE region = 'US'
UNION ALL
SELECT order_id, status, profit, 'EU' AS region
FROM orders WHERE region = 'EU'
UNION ALL
SELECT order_id, status, profit, 'APAC' AS region
FROM orders WHERE region = 'APAC';

Deduplicating Overlapping Sources

When two data sources might contain the same rows (like overlapping date ranges from incremental loads), UNION removes the duplicates automatically. This is a quick dedup approach, though it compares every column. If you only need to dedup on specific columns, a CTE with ROW_NUMBER is more precise.

-- Two overlapping exports, remove exact duplicates
SELECT user_id, email, signup_date
FROM users
WHERE user_id <= 294
UNION
SELECT user_id, email, signup_date
FROM users
WHERE user_id BETWEEN 197 AND 488;
-- Rows appearing in both exports are kept only once

Building Reference Sets for IN Clauses

Sometimes you need a set of values from multiple sources for filtering. UNION deduplicates the combined set so you do not get inflated results when using it as a subquery. This pattern appears in interview questions where you need to find all customers from multiple criteria.

-- All users who made a transaction OR triggered an event
SELECT user_id FROM transactions
UNION
SELECT user_id FROM event_data;

-- Use as a filter
SELECT * FROM users
WHERE user_id IN (
  SELECT user_id FROM transactions
  UNION
  SELECT user_id FROM event_data
);

Adding Computed Rows to a Result Set

UNION ALL can append summary rows to detail rows. This pattern is common in reporting queries where you want a totals row at the bottom. Each SELECT produces a different type of row, and UNION ALL stacks them together. Column count and types must match across all SELECTs.

-- Detail rows plus a summary row
SELECT
  department,
  emp_name,
  salary
FROM employees
UNION ALL
SELECT
  'TOTAL' AS department,
  NULL AS emp_name,
  SUM(salary) AS salary
FROM employees;

Related Set Operators: INTERSECT and EXCEPT

SQL has four set operators. UNION and UNION ALL combine sets. INTERSECT finds rows present in both. EXCEPT (MINUS in Oracle) finds rows present in the first but not the second. All four require matching column counts and types.

-- INTERSECT: users who transacted AND triggered an event
SELECT user_id FROM transactions
INTERSECT
SELECT user_id FROM event_data;

-- EXCEPT: users who transacted but never triggered an event
SELECT user_id FROM transactions
EXCEPT
SELECT user_id FROM event_data;

-- Both INTERSECT and EXCEPT deduplicate by default
-- INTERSECT ALL and EXCEPT ALL preserve duplicates (PostgreSQL)

Interviewers sometimes ask about all four set operators in one question. Knowing INTERSECT and EXCEPT alongside UNION shows breadth. The main gotcha: EXCEPT is not commutative. A EXCEPT B is not the same as B EXCEPT A.

4 UNION Interview Questions

Q1: What is the difference between UNION and UNION ALL? When would you use each?

What they test: Fundamental understanding. UNION deduplicates by comparing all columns across both result sets. UNION ALL concatenates without dedup. The interviewer wants to hear about the performance difference and wants you to default to UNION ALL unless dedup is explicitly needed. Using UNION when UNION ALL is sufficient wastes compute and signals that the candidate does not think about query cost. Approach: State the difference clearly: UNION = dedup, UNION ALL = no dedup. Then give the performance reason: UNION requires sorting or hashing to find duplicates. Default to UNION ALL in pipelines where sources have unique keys. Use UNION only when overlapping data might produce genuine duplicates that need removal.

Q2: You have two staging tables from different data sources that might have overlapping records. How would you combine them and handle duplicates?

What they test: Practical dedup in a pipeline context. The interviewer checks whether you understand that UNION deduplicates across all columns, which may not be what you want. If two records share a customer_id but have different updated_at timestamps, UNION treats them as different rows. The interviewer wants to see awareness of this limitation. Approach: If exact-row dedup is sufficient, UNION works. If dedup should be based on a key (like customer_id keeping the latest version), use UNION ALL followed by ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY updated_at DESC) and filter to rn = 1. Explain why the second approach is more reliable.

Q3: A query uses UNION to combine 10 monthly tables. It runs slowly. How would you improve it?

What they test: Performance awareness. UNION on 10 tables means the engine must deduplicate across all 10 result sets combined. If the monthly tables have non-overlapping data (each row appears in exactly one month), UNION ALL is correct and dramatically faster. The interviewer checks whether the candidate asks 'do duplicates actually exist across months?' before optimizing. Approach: First question: can rows appear in multiple monthly tables? If no, switch to UNION ALL. If yes, consider UNION ALL with a downstream dedup step using ROW_NUMBER. Also check if the 10 tables can be replaced with a single partitioned table query, eliminating the UNION entirely.

Q4: Write a query that finds customer IDs present in the orders table but not in the returns table.

What they test: This is a set difference problem, not a UNION problem. The interviewer may frame it in the context of UNION/INTERSECT/EXCEPT to see if you know all three set operators. The answer is EXCEPT (or MINUS in Oracle). A candidate who tries to solve this with UNION alone demonstrates limited set operation knowledge. Approach: SELECT customer_id FROM orders EXCEPT SELECT customer_id FROM returns. Alternative: LEFT JOIN with IS NULL check, or NOT EXISTS subquery. Mention that EXCEPT deduplicates by default (like UNION). If you need all occurrences, some engines support EXCEPT ALL.

Frequently asked questions

What is the difference between UNION and UNION ALL in SQL?+

UNION combines two result sets and removes duplicate rows. UNION ALL combines two result sets and keeps all rows, including duplicates. UNION is slower because it must sort or hash the combined result to find and eliminate duplicates. UNION ALL simply concatenates the rows. Use UNION ALL when you know duplicates do not exist or when you want to keep them.

Is UNION ALL faster than UNION?+

Yes. UNION ALL is faster because it skips the deduplication step. UNION must compare every row across both result sets to find duplicates, which requires sorting or building a hash table. For large datasets, this overhead is significant. In data engineering pipelines, UNION ALL is the default choice unless deduplication is explicitly required.

Do UNION and UNION ALL require the same number of columns?+

Yes. Both require that all SELECT statements have the same number of columns and that the corresponding columns have compatible data types. The column names in the output come from the first SELECT statement. If the column counts do not match, the query will fail with a syntax error.

Does UNION remove duplicates within a single SELECT or only between SELECTs?+

UNION removes duplicates across the entire combined result. If the first SELECT returns duplicate rows on its own, UNION removes those too. It operates on the full merged set, not just the overlap between queries. UNION ALL preserves all rows from all SELECTs, including duplicates within a single SELECT.

02 / Why practice

Default to ALL. Your SLA Will Thank You.

01
Active recall beats re-reading by 50%
Cognitive-science meta-reviews (Dunlosky et al., 2013) rank practice testing as a top-tier study technique, while re-reading and highlighting rank near the bottom
02
76% of hiring managers reject on the coding task, not the resume
From HackerRank's 2024 Developer Skills Report. Candidates who look strong on paper still fail the live screen if they haven't done timed, executable practice
03
Five problem shapes cover 80% of data engineer loops
Dedup, sessionization, top-N-per-group, slowly-changing dimensions, partition tricks. Writing the shapes by hand turns the unfamiliar into pattern recognition

Try a SQL Problem