SQL Practice

SQL UNION vs UNION ALL

An Airflow DAG at a streaming startup crossed its 15-minute SLA every morning for six weeks before anyone looked at the query. The culprit was a plain UNION across 14 regional shards that was sort-deduping 280M rows the pipeline had already guaranteed were unique. Swapping UNION for UNION ALL dropped the stage from 11 minutes to 40 seconds. One word. Six weeks of pager pain.

Based on production patterns, roughly 90% of pipeline UNION usage should be UNION ALL. The dedup path is expensive and almost always unnecessary when the upstream grain is already unique. Interviewers know this, and they watch whether you default to ALL or reach for the slower variant out of habit.

90%

Pipeline UNIONs should be ALL

280M

Rows sort-deduped needlessly

11m

Stage runtime before fix

40s

After the one-word fix

Source: DataDriven analysis of 1,042 verified data engineering interview rounds.

Syntax

Both operators slot between SELECTs with a strict contract: same column count, compatible types, same positional order. The first SELECT's column names win. Skip that rule once and you'll get an error loud enough to spot. Skip the UNION vs UNION ALL choice and your pipeline just runs slower every night until somebody notices.

-- UNION: removes duplicates
SELECT column1, column2 FROM table_a
UNION
SELECT column1, column2 FROM table_b;

-- UNION ALL: keeps all rows
SELECT column1, column2 FROM table_a
UNION ALL
SELECT column1, column2 FROM table_b;

-- Multiple UNIONs (can mix UNION and UNION ALL)
SELECT id, name FROM customers_2023
UNION ALL
SELECT id, name FROM customers_2024
UNION ALL
SELECT id, name FROM customers_2025;

Interview note: ORDER BY can only appear once, at the very end, after all UNION/UNION ALL operations. It sorts the final combined result. Putting ORDER BY inside an individual SELECT (without wrapping it in a subquery) produces a syntax error in most engines.

UNION vs UNION ALL Comparison

Every difference flows from the deduplication step. UNION pays the cost; UNION ALL does not.

FeatureUNIONUNION ALL
Duplicate handlingRemoves duplicate rows across both result setsKeeps every row from both result sets, duplicates included
SortingImplicitly sorts to detect duplicates (or uses hash-based dedup)No sorting or hashing needed
PerformanceSlower due to dedup overhead (sort or hash)Faster because it just concatenates results
Row countLess than or equal to the sum of both queriesExactly the sum of both queries
Use caseWhen duplicates between sets must be eliminatedWhen you want all rows or know there are no duplicates

Concrete Example

-- Table A: (1, 'Alice'), (2, 'Bob')
-- Table B: (2, 'Bob'), (3, 'Carol')

-- UNION result: 3 rows
-- (1, 'Alice'), (2, 'Bob'), (3, 'Carol')
-- Bob appears once because UNION deduplicates

-- UNION ALL result: 4 rows
-- (1, 'Alice'), (2, 'Bob'), (2, 'Bob'), (3, 'Carol')
-- Bob appears twice, once from each table

UNION compares entire rows to detect duplicates. If even one column differs (like a timestamp), the rows are considered distinct and both are kept. This is why UNION sometimes fails to deduplicate the way people expect: it does exact row matching, not key-based matching.

Performance Implications

The performance gap between UNION and UNION ALL grows linearly with data volume. For small result sets (hundreds of rows), the difference is negligible. For millions of rows, it is substantial.

UNION overhead: sort or hash

The engine must build a data structure (sorted list or hash table) containing every row from the combined output. For a result set of N rows, this costs O(N log N) for sort-based dedup or O(N) for hash-based dedup, plus the memory to hold the structure. If the result does not fit in memory, it spills to disk.

UNION ALL overhead: zero

UNION ALL streams rows from each SELECT sequentially. No buffering, no sorting, no hashing. The cost is simply the sum of executing each individual SELECT. For pipeline queries that combine partitioned tables, this is the correct choice 90%+ of the time.

Pipeline rule of thumb: Use UNION ALL by default. Only switch to UNION when you have confirmed that cross-source duplicates exist and must be eliminated. In many pipelines, each source has unique primary keys, making UNION unnecessary and wasteful.

Common Patterns

Four patterns that cover the most common UNION and UNION ALL scenarios in both interviews and production pipelines.

Combining Tables with Identical Schema

The most common UNION ALL use case in data engineering: stacking tables that share the same schema. Regional tables, monthly partitions, or source-specific staging tables often need to be combined into one unified dataset. Since each source table has unique rows, UNION ALL is correct and faster than UNION.

-- Combine regional order tables
SELECT order_id, customer_id, amount, 'US' AS region
FROM orders_us
UNION ALL
SELECT order_id, customer_id, amount, 'EU' AS region
FROM orders_eu
UNION ALL
SELECT order_id, customer_id, amount, 'APAC' AS region
FROM orders_apac;

Deduplicating Overlapping Sources

When two data sources might contain the same rows (like overlapping date ranges from incremental loads), UNION removes the duplicates automatically. This is a quick dedup approach, though it compares every column. If you only need to dedup on specific columns, a CTE with ROW_NUMBER is more precise.

-- Two overlapping exports, remove exact duplicates
SELECT customer_id, email, signup_date
FROM export_jan
UNION
SELECT customer_id, email, signup_date
FROM export_feb;
-- Rows appearing in both months are kept only once

Building Reference Sets for IN Clauses

Sometimes you need a set of values from multiple sources for filtering. UNION deduplicates the combined set so you do not get inflated results when using it as a subquery. This pattern appears in interview questions where you need to find all customers from multiple criteria.

-- All customers who placed an order OR submitted a support ticket
SELECT customer_id FROM orders
UNION
SELECT customer_id FROM support_tickets;

-- Use as a filter
SELECT * FROM customers
WHERE customer_id IN (
  SELECT customer_id FROM orders
  UNION
  SELECT customer_id FROM support_tickets
);

Adding Computed Rows to a Result Set

UNION ALL can append summary rows to detail rows. This pattern is common in reporting queries where you want a totals row at the bottom. Each SELECT produces a different type of row, and UNION ALL stacks them together. Column count and types must match across all SELECTs.

-- Detail rows plus a summary row
SELECT
  department,
  employee_name,
  salary
FROM employees
UNION ALL
SELECT
  'TOTAL' AS department,
  NULL AS employee_name,
  SUM(salary) AS salary
FROM employees;

Related Set Operators: INTERSECT and EXCEPT

SQL has four set operators. UNION and UNION ALL combine sets. INTERSECT finds rows present in both. EXCEPT (MINUS in Oracle) finds rows present in the first but not the second. All four require matching column counts and types.

-- INTERSECT: customers who ordered AND returned something
SELECT customer_id FROM orders
INTERSECT
SELECT customer_id FROM returns;

-- EXCEPT: customers who ordered but never returned
SELECT customer_id FROM orders
EXCEPT
SELECT customer_id FROM returns;

-- Both INTERSECT and EXCEPT deduplicate by default
-- INTERSECT ALL and EXCEPT ALL preserve duplicates (PostgreSQL)

Interviewers sometimes ask about all four set operators in one question. Knowing INTERSECT and EXCEPT alongside UNION shows breadth. The main gotcha: EXCEPT is not commutative. A EXCEPT B is not the same as B EXCEPT A.

4 UNION Interview Questions

These questions test set operation knowledge and performance awareness. Each includes the interviewer's intent and a recommended approach.

Q1: What is the difference between UNION and UNION ALL? When would you use each?

What they test:

Fundamental understanding. UNION deduplicates by comparing all columns across both result sets. UNION ALL concatenates without dedup. The interviewer wants to hear about the performance difference and wants you to default to UNION ALL unless dedup is explicitly needed. Using UNION when UNION ALL is sufficient wastes compute and signals that the candidate does not think about query cost.

Approach:

State the difference clearly: UNION = dedup, UNION ALL = no dedup. Then give the performance reason: UNION requires sorting or hashing to find duplicates. Default to UNION ALL in pipelines where sources have unique keys. Use UNION only when overlapping data might produce genuine duplicates that need removal.

Q2: You have two staging tables from different data sources that might have overlapping records. How would you combine them and handle duplicates?

What they test:

Practical dedup in a pipeline context. The interviewer checks whether you understand that UNION deduplicates across all columns, which may not be what you want. If two records share a customer_id but have different updated_at timestamps, UNION treats them as different rows. The interviewer wants to see awareness of this limitation.

Approach:

If exact-row dedup is sufficient, UNION works. If dedup should be based on a key (like customer_id keeping the latest version), use UNION ALL followed by ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY updated_at DESC) and filter to rn = 1. Explain why the second approach is more reliable.

Q3: A query uses UNION to combine 10 monthly tables. It runs slowly. How would you improve it?

What they test:

Performance awareness. UNION on 10 tables means the engine must deduplicate across all 10 result sets combined. If the monthly tables have non-overlapping data (each row appears in exactly one month), UNION ALL is correct and dramatically faster. The interviewer checks whether the candidate asks 'do duplicates actually exist across months?' before optimizing.

Approach:

First question: can rows appear in multiple monthly tables? If no, switch to UNION ALL. If yes, consider UNION ALL with a downstream dedup step using ROW_NUMBER. Also check if the 10 tables can be replaced with a single partitioned table query, eliminating the UNION entirely.

Q4: Write a query that finds customer IDs present in the orders table but not in the returns table.

What they test:

This is a set difference problem, not a UNION problem. The interviewer may frame it in the context of UNION/INTERSECT/EXCEPT to see if you know all three set operators. The answer is EXCEPT (or MINUS in Oracle). A candidate who tries to solve this with UNION alone demonstrates limited set operation knowledge.

Approach:

SELECT customer_id FROM orders EXCEPT SELECT customer_id FROM returns. Alternative: LEFT JOIN with IS NULL check, or NOT EXISTS subquery. Mention that EXCEPT deduplicates by default (like UNION). If you need all occurrences, some engines support EXCEPT ALL.

UNION vs UNION ALL FAQ

What is the difference between UNION and UNION ALL in SQL?+
UNION combines two result sets and removes duplicate rows. UNION ALL combines two result sets and keeps all rows, including duplicates. UNION is slower because it must sort or hash the combined result to find and eliminate duplicates. UNION ALL simply concatenates the rows. Use UNION ALL when you know duplicates do not exist or when you want to keep them.
Is UNION ALL faster than UNION?+
Yes. UNION ALL is faster because it skips the deduplication step. UNION must compare every row across both result sets to find duplicates, which requires sorting or building a hash table. For large datasets, this overhead is significant. In data engineering pipelines, UNION ALL is the default choice unless deduplication is explicitly required.
Do UNION and UNION ALL require the same number of columns?+
Yes. Both require that all SELECT statements have the same number of columns and that the corresponding columns have compatible data types. The column names in the output come from the first SELECT statement. If the column counts do not match, the query will fail with a syntax error.
Does UNION remove duplicates within a single SELECT or only between SELECTs?+
UNION removes duplicates across the entire combined result. If the first SELECT returns duplicate rows on its own, UNION removes those too. It operates on the full merged set, not just the overlap between queries. UNION ALL preserves all rows from all SELECTs, including duplicates within a single SELECT.

Default to ALL. Your SLA Will Thank You.

Run the same query with UNION and UNION ALL, read both plans, and watch the sort node vanish. Once you've seen it, you won't reach for plain UNION on muscle memory again.