SQL UNION vs UNION ALL Explained
An Airflow DAG at a streaming startup crossed its 15-minute SLA every morning for six weeks before anyone looked at the query. The culprit was a plain UNION across 14 regional shards that was sort-deduping 280M rows the pipeline had already guaranteed were unique. Swapping UNION for UNION ALL dropped the stage from 11 minutes to 40 seconds. One word. Six weeks of pager pain.
Syntax
Know UNION vs UNION ALL the way the interviewer who asks it knows it.
| Feature | UNION | UNION ALL |
|---|---|---|
| Duplicate handling | Removes duplicate rows across both result sets | Keeps every row from both result sets, duplicates included |
| Sorting | Implicitly sorts to detect duplicates (or uses hash-based dedup) | No sorting or hashing needed |
| Performance | Slower due to dedup overhead (sort or hash) | Faster because it just concatenates results |
| Row count | Less than or equal to the sum of both queries | Exactly the sum of both queries |
| Use case | When duplicates between sets must be eliminated | When you want all rows or know there are no duplicates |
UNION vs UNION ALL Comparison
-- Table A: (1, 'Alice'), (2, 'Bob')
-- Table B: (2, 'Bob'), (3, 'Carol')
-- UNION result: 3 rows
-- (1, 'Alice'), (2, 'Bob'), (3, 'Carol')
-- Bob appears once because UNION deduplicates
-- UNION ALL result: 4 rows
-- (1, 'Alice'), (2, 'Bob'), (2, 'Bob'), (3, 'Carol')
-- Bob appears twice, once from each tableConcrete Example
-- UNION: removes duplicates
SELECT column1, column2 FROM table_a
UNION
SELECT column1, column2 FROM table_b;
-- UNION ALL: keeps all rows
SELECT column1, column2 FROM table_a
UNION ALL
SELECT column1, column2 FROM table_b;
-- Multiple UNIONs (can mix UNION and UNION ALL)
SELECT id, name FROM customers_2023
UNION ALL
SELECT id, name FROM customers_2024
UNION ALL
SELECT id, name FROM customers_2025;Top Product Categories by Sales
The highest-grossing categories.
Pulled from debriefs where SQL was the gate.
Performance Implications
Combining Tables with Identical Schema
The most common UNION ALL use case in data engineering: stacking tables that share the same schema. Regional tables, monthly partitions, or source-specific staging tables often need to be combined into one unified dataset. Since each source table has unique rows, UNION ALL is correct and faster than UNION.
-- Combine regional order tables
SELECT order_id, customer_id, amount, 'US' AS region
FROM orders_us
UNION ALL
SELECT order_id, customer_id, amount, 'EU' AS region
FROM orders_eu
UNION ALL
SELECT order_id, customer_id, amount, 'APAC' AS region
FROM orders_apac;Deduplicating Overlapping Sources
When two data sources might contain the same rows (like overlapping date ranges from incremental loads), UNION removes the duplicates automatically. This is a quick dedup approach, though it compares every column. If you only need to dedup on specific columns, a CTE with ROW_NUMBER is more precise.
-- Two overlapping exports, remove exact duplicates
SELECT customer_id, email, signup_date
FROM export_jan
UNION
SELECT customer_id, email, signup_date
FROM export_feb;
-- Rows appearing in both months are kept only onceBuilding Reference Sets for IN Clauses
Sometimes you need a set of values from multiple sources for filtering. UNION deduplicates the combined set so you do not get inflated results when using it as a subquery. This pattern appears in interview questions where you need to find all customers from multiple criteria.
-- All customers who placed an order OR submitted a support ticket
SELECT customer_id FROM orders
UNION
SELECT customer_id FROM support_tickets;
-- Use as a filter
SELECT * FROM customers
WHERE customer_id IN (
SELECT customer_id FROM orders
UNION
SELECT customer_id FROM support_tickets
);Adding Computed Rows to a Result Set
UNION ALL can append summary rows to detail rows. This pattern is common in reporting queries where you want a totals row at the bottom. Each SELECT produces a different type of row, and UNION ALL stacks them together. Column count and types must match across all SELECTs.
-- Detail rows plus a summary row
SELECT
department,
employee_name,
salary
FROM employees
UNION ALL
SELECT
'TOTAL' AS department,
NULL AS employee_name,
SUM(salary) AS salary
FROM employees;Common Patterns
-- INTERSECT: customers who ordered AND returned something
SELECT customer_id FROM orders
INTERSECT
SELECT customer_id FROM returns;
-- EXCEPT: customers who ordered but never returned
SELECT customer_id FROM orders
EXCEPT
SELECT customer_id FROM returns;
-- Both INTERSECT and EXCEPT deduplicate by default
-- INTERSECT ALL and EXCEPT ALL preserve duplicates (PostgreSQL)Related Set Operators: INTERSECT and EXCEPT
4 UNION Interview Questions
Q1: What is the difference between UNION and UNION ALL? When would you use each?
What they test: Fundamental understanding. UNION deduplicates by comparing all columns across both result sets. UNION ALL concatenates without dedup. The interviewer wants to hear about the performance difference and wants you to default to UNION ALL unless dedup is explicitly needed. Using UNION when UNION ALL is sufficient wastes compute and signals that the candidate does not think about query cost. Approach: State the difference clearly: UNION = dedup, UNION ALL = no dedup. Then give the performance reason: UNION requires sorting or hashing to find duplicates. Default to UNION ALL in pipelines where sources have unique keys. Use UNION only when overlapping data might produce genuine duplicates that need removal.
Q2: You have two staging tables from different data sources that might have overlapping records. How would you combine them and handle duplicates?
What they test: Practical dedup in a pipeline context. The interviewer checks whether you understand that UNION deduplicates across all columns, which may not be what you want. If two records share a customer_id but have different updated_at timestamps, UNION treats them as different rows. The interviewer wants to see awareness of this limitation. Approach: If exact-row dedup is sufficient, UNION works. If dedup should be based on a key (like customer_id keeping the latest version), use UNION ALL followed by ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY updated_at DESC) and filter to rn = 1. Explain why the second approach is more reliable.
Q3: A query uses UNION to combine 10 monthly tables. It runs slowly. How would you improve it?
What they test: Performance awareness. UNION on 10 tables means the engine must deduplicate across all 10 result sets combined. If the monthly tables have non-overlapping data (each row appears in exactly one month), UNION ALL is correct and dramatically faster. The interviewer checks whether the candidate asks 'do duplicates actually exist across months?' before optimizing. Approach: First question: can rows appear in multiple monthly tables? If no, switch to UNION ALL. If yes, consider UNION ALL with a downstream dedup step using ROW_NUMBER. Also check if the 10 tables can be replaced with a single partitioned table query, eliminating the UNION entirely.
Q4: Write a query that finds customer IDs present in the orders table but not in the returns table.
What they test: This is a set difference problem, not a UNION problem. The interviewer may frame it in the context of UNION/INTERSECT/EXCEPT to see if you know all three set operators. The answer is EXCEPT (or MINUS in Oracle). A candidate who tries to solve this with UNION alone demonstrates limited set operation knowledge. Approach: SELECT customer_id FROM orders EXCEPT SELECT customer_id FROM returns. Alternative: LEFT JOIN with IS NULL check, or NOT EXISTS subquery. Mention that EXCEPT deduplicates by default (like UNION). If you need all occurrences, some engines support EXCEPT ALL.
- UNION overhead: sort or hash. The engine must build a data structure (sorted list or hash table) containing every row from the combined output. For a result set of N rows, this costs O(N log N) for sort-based dedup or O(N) for hash-based dedup, plus the memory to hold the structure. If the result does not fit in memory, it spills to disk.
- UNION ALL overhead: zero. UNION ALL streams rows from each SELECT sequentially. No buffering, no sorting, no hashing. The cost is simply the sum of executing each individual SELECT. For pipeline queries that combine partitioned tables, this is the correct choice 90%+ of the time.
- Pipeline rule of thumb: Use UNION ALL by default. Only switch to UNION when you have confirmed that cross-source duplicates exist and must be eliminated. In many pipelines, each source has unique primary keys, making UNION unnecessary and wasteful.
Frequently asked questions
What is the difference between UNION and UNION ALL in SQL?+
Is UNION ALL faster than UNION?+
Do UNION and UNION ALL require the same number of columns?+
Does UNION remove duplicates within a single SELECT or only between SELECTs?+
Default to ALL. Your SLA Will Thank You.
- 01
Active recall beats re-reading by 50%
Cognitive-science meta-reviews (Dunlosky et al., 2013) rank practice testing as a top-tier study technique, while re-reading and highlighting rank near the bottom
- 02
76% of hiring managers reject on the coding task, not the resume
From HackerRank's 2024 Developer Skills Report. Candidates who look strong on paper still fail the live screen if they haven't done timed, executable practice
- 03
Five problem shapes cover 80% of data engineer loops
Dedup, sessionization, top-N-per-group, slowly-changing dimensions, partition tricks. Writing the shapes by hand turns the unfamiliar into pattern recognition