An Airflow DAG at a streaming startup crossed its 15-minute SLA every morning for six weeks before anyone looked at the query. The culprit was a plain UNION across 14 regional shards that was sort-deduping 280M rows the pipeline had already guaranteed were unique. Swapping UNION for UNION ALL dropped the stage from 11 minutes to 40 seconds. One word. Six weeks of pager pain.
Based on production patterns, roughly 90% of pipeline UNION usage should be UNION ALL. The dedup path is expensive and almost always unnecessary when the upstream grain is already unique. Interviewers know this, and they watch whether you default to ALL or reach for the slower variant out of habit.
Pipeline UNIONs should be ALL
Rows sort-deduped needlessly
Stage runtime before fix
After the one-word fix
Source: DataDriven analysis of 1,042 verified data engineering interview rounds.
Both operators slot between SELECTs with a strict contract: same column count, compatible types, same positional order. The first SELECT's column names win. Skip that rule once and you'll get an error loud enough to spot. Skip the UNION vs UNION ALL choice and your pipeline just runs slower every night until somebody notices.
-- UNION: removes duplicates
SELECT column1, column2 FROM table_a
UNION
SELECT column1, column2 FROM table_b;
-- UNION ALL: keeps all rows
SELECT column1, column2 FROM table_a
UNION ALL
SELECT column1, column2 FROM table_b;
-- Multiple UNIONs (can mix UNION and UNION ALL)
SELECT id, name FROM customers_2023
UNION ALL
SELECT id, name FROM customers_2024
UNION ALL
SELECT id, name FROM customers_2025;Interview note: ORDER BY can only appear once, at the very end, after all UNION/UNION ALL operations. It sorts the final combined result. Putting ORDER BY inside an individual SELECT (without wrapping it in a subquery) produces a syntax error in most engines.
Every difference flows from the deduplication step. UNION pays the cost; UNION ALL does not.
| Feature | UNION | UNION ALL |
|---|---|---|
| Duplicate handling | Removes duplicate rows across both result sets | Keeps every row from both result sets, duplicates included |
| Sorting | Implicitly sorts to detect duplicates (or uses hash-based dedup) | No sorting or hashing needed |
| Performance | Slower due to dedup overhead (sort or hash) | Faster because it just concatenates results |
| Row count | Less than or equal to the sum of both queries | Exactly the sum of both queries |
| Use case | When duplicates between sets must be eliminated | When you want all rows or know there are no duplicates |
-- Table A: (1, 'Alice'), (2, 'Bob')
-- Table B: (2, 'Bob'), (3, 'Carol')
-- UNION result: 3 rows
-- (1, 'Alice'), (2, 'Bob'), (3, 'Carol')
-- Bob appears once because UNION deduplicates
-- UNION ALL result: 4 rows
-- (1, 'Alice'), (2, 'Bob'), (2, 'Bob'), (3, 'Carol')
-- Bob appears twice, once from each tableUNION compares entire rows to detect duplicates. If even one column differs (like a timestamp), the rows are considered distinct and both are kept. This is why UNION sometimes fails to deduplicate the way people expect: it does exact row matching, not key-based matching.
The performance gap between UNION and UNION ALL grows linearly with data volume. For small result sets (hundreds of rows), the difference is negligible. For millions of rows, it is substantial.
UNION overhead: sort or hash
The engine must build a data structure (sorted list or hash table) containing every row from the combined output. For a result set of N rows, this costs O(N log N) for sort-based dedup or O(N) for hash-based dedup, plus the memory to hold the structure. If the result does not fit in memory, it spills to disk.
UNION ALL overhead: zero
UNION ALL streams rows from each SELECT sequentially. No buffering, no sorting, no hashing. The cost is simply the sum of executing each individual SELECT. For pipeline queries that combine partitioned tables, this is the correct choice 90%+ of the time.
Pipeline rule of thumb: Use UNION ALL by default. Only switch to UNION when you have confirmed that cross-source duplicates exist and must be eliminated. In many pipelines, each source has unique primary keys, making UNION unnecessary and wasteful.
Four patterns that cover the most common UNION and UNION ALL scenarios in both interviews and production pipelines.
The most common UNION ALL use case in data engineering: stacking tables that share the same schema. Regional tables, monthly partitions, or source-specific staging tables often need to be combined into one unified dataset. Since each source table has unique rows, UNION ALL is correct and faster than UNION.
-- Combine regional order tables
SELECT order_id, customer_id, amount, 'US' AS region
FROM orders_us
UNION ALL
SELECT order_id, customer_id, amount, 'EU' AS region
FROM orders_eu
UNION ALL
SELECT order_id, customer_id, amount, 'APAC' AS region
FROM orders_apac;When two data sources might contain the same rows (like overlapping date ranges from incremental loads), UNION removes the duplicates automatically. This is a quick dedup approach, though it compares every column. If you only need to dedup on specific columns, a CTE with ROW_NUMBER is more precise.
-- Two overlapping exports, remove exact duplicates
SELECT customer_id, email, signup_date
FROM export_jan
UNION
SELECT customer_id, email, signup_date
FROM export_feb;
-- Rows appearing in both months are kept only onceSometimes you need a set of values from multiple sources for filtering. UNION deduplicates the combined set so you do not get inflated results when using it as a subquery. This pattern appears in interview questions where you need to find all customers from multiple criteria.
-- All customers who placed an order OR submitted a support ticket
SELECT customer_id FROM orders
UNION
SELECT customer_id FROM support_tickets;
-- Use as a filter
SELECT * FROM customers
WHERE customer_id IN (
SELECT customer_id FROM orders
UNION
SELECT customer_id FROM support_tickets
);UNION ALL can append summary rows to detail rows. This pattern is common in reporting queries where you want a totals row at the bottom. Each SELECT produces a different type of row, and UNION ALL stacks them together. Column count and types must match across all SELECTs.
-- Detail rows plus a summary row
SELECT
department,
employee_name,
salary
FROM employees
UNION ALL
SELECT
'TOTAL' AS department,
NULL AS employee_name,
SUM(salary) AS salary
FROM employees;SQL has four set operators. UNION and UNION ALL combine sets. INTERSECT finds rows present in both. EXCEPT (MINUS in Oracle) finds rows present in the first but not the second. All four require matching column counts and types.
-- INTERSECT: customers who ordered AND returned something
SELECT customer_id FROM orders
INTERSECT
SELECT customer_id FROM returns;
-- EXCEPT: customers who ordered but never returned
SELECT customer_id FROM orders
EXCEPT
SELECT customer_id FROM returns;
-- Both INTERSECT and EXCEPT deduplicate by default
-- INTERSECT ALL and EXCEPT ALL preserve duplicates (PostgreSQL)Interviewers sometimes ask about all four set operators in one question. Knowing INTERSECT and EXCEPT alongside UNION shows breadth. The main gotcha: EXCEPT is not commutative. A EXCEPT B is not the same as B EXCEPT A.
These questions test set operation knowledge and performance awareness. Each includes the interviewer's intent and a recommended approach.
What they test:
Fundamental understanding. UNION deduplicates by comparing all columns across both result sets. UNION ALL concatenates without dedup. The interviewer wants to hear about the performance difference and wants you to default to UNION ALL unless dedup is explicitly needed. Using UNION when UNION ALL is sufficient wastes compute and signals that the candidate does not think about query cost.
Approach:
State the difference clearly: UNION = dedup, UNION ALL = no dedup. Then give the performance reason: UNION requires sorting or hashing to find duplicates. Default to UNION ALL in pipelines where sources have unique keys. Use UNION only when overlapping data might produce genuine duplicates that need removal.
What they test:
Practical dedup in a pipeline context. The interviewer checks whether you understand that UNION deduplicates across all columns, which may not be what you want. If two records share a customer_id but have different updated_at timestamps, UNION treats them as different rows. The interviewer wants to see awareness of this limitation.
Approach:
If exact-row dedup is sufficient, UNION works. If dedup should be based on a key (like customer_id keeping the latest version), use UNION ALL followed by ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY updated_at DESC) and filter to rn = 1. Explain why the second approach is more reliable.
What they test:
Performance awareness. UNION on 10 tables means the engine must deduplicate across all 10 result sets combined. If the monthly tables have non-overlapping data (each row appears in exactly one month), UNION ALL is correct and dramatically faster. The interviewer checks whether the candidate asks 'do duplicates actually exist across months?' before optimizing.
Approach:
First question: can rows appear in multiple monthly tables? If no, switch to UNION ALL. If yes, consider UNION ALL with a downstream dedup step using ROW_NUMBER. Also check if the 10 tables can be replaced with a single partitioned table query, eliminating the UNION entirely.
What they test:
This is a set difference problem, not a UNION problem. The interviewer may frame it in the context of UNION/INTERSECT/EXCEPT to see if you know all three set operators. The answer is EXCEPT (or MINUS in Oracle). A candidate who tries to solve this with UNION alone demonstrates limited set operation knowledge.
Approach:
SELECT customer_id FROM orders EXCEPT SELECT customer_id FROM returns. Alternative: LEFT JOIN with IS NULL check, or NOT EXISTS subquery. Mention that EXCEPT deduplicates by default (like UNION). If you need all occurrences, some engines support EXCEPT ALL.
Run the same query with UNION and UNION ALL, read both plans, and watch the sort node vanish. Once you've seen it, you won't reach for plain UNION on muscle memory again.