SQL UNION: UNION vs UNION ALL vs JOIN

UNION is the vertical glue in every pipeline. Where JOIN matches rows side by side, UNION stacks them end to end. Architecturally it's the primitive behind every 'combine these regional shards' job, every 'merge archive table with hot table' query, and every historical rollup that spans partitioned storage.

~8%

Pipeline rounds with set ops

Flavors (UNION vs UNION ALL)

Architecture patterns

Shards you can fan in

Source: DataDriven analysis of 1,042 verified data engineering interview rounds.

Prepare for the interview

01 / Open invite

02min.

Know UNION the way the interviewer who asks it knows it.

a UNION query, the same shape a screen would give you.

The diff against expected. Where ties broke. What you missed.

sandbox

1SELECT user_id,

2 COUNT(*) AS sessions

3FROM events

4WHERE ts >= NOW() - INTERVAL '7 day'

Execute your solution0.4s avg.

BlockInterview question

Solve a UNION problem

UNION Syntax

The contract is strict by design. Each SELECT contributes the same column count in the same positional order, and the engine reconciles compatible types via implicit casts. The first SELECT wins the column names, which is why every pipeline author learns to put the canonical schema at the top. It's a tiny convention with an outsized effect on downstream views.

-- Basic UNION (removes duplicates)
SELECT region, status FROM orders
UNION
SELECT region, status FROM orders;

-- UNION ALL (keeps duplicates, faster)
SELECT region, status FROM orders
UNION ALL
SELECT region, status FROM orders;

-- Multiple UNIONs with ORDER BY
SELECT region, profit, 'US' AS bucket FROM orders WHERE region = 'US'
UNION ALL
SELECT region, profit, 'EU' AS bucket FROM orders WHERE region = 'EU'
UNION ALL
SELECT region, profit, 'APAC' AS bucket FROM orders WHERE region = 'APAC'
ORDER BY profit DESC;

Column matching rule: Columns are matched by position, not by name. The first column of each SELECT is combined, then the second, and so on. If you accidentally swap two columns in one SELECT, the query still runs but produces incorrect results.

Verbose by Design

> A platform team is auditing endpoint paths in the request log, where the outer slashes are noise but every inner segment marks a real routing level. For each distinct path, report its length once the outer slashes are ignored and how many meaningful segments it holds, deepest paths first.

UNION vs UNION ALL

The only difference is deduplication. UNION removes duplicate rows. UNION ALL does not. This single difference has major performance implications on large datasets because deduplication requires sorting or hashing the entire combined result.

Behavior	UNION	UNION ALL
Duplicate handling	Removes duplicate rows from the combined result	Keeps all rows, including duplicates
Performance	Slower: requires sort or hash to deduplicate	Faster: no deduplication step
Row count	May be fewer rows than the sum of both queries	Always equals the sum of both queries' row counts
Use when	You need distinct results from overlapping datasets	Datasets do not overlap, or you want to preserve duplicates
Sort operation	Implicit sort/hash for dedup (can be expensive on large sets)	No implicit sort; rows appear in query order

-- Subset A: users with id <= 294
-- Subset B: users with id between 197 and 488
-- The two subsets overlap on ids 197 and 294

-- UNION: overlapping rows appear once each
SELECT user_id, username FROM users
WHERE user_id <= 294
UNION
SELECT user_id, username FROM users
WHERE user_id BETWEEN 197 AND 488;

-- UNION ALL: overlapping rows appear twice
SELECT user_id, username FROM users
WHERE user_id <= 294
UNION ALL
SELECT user_id, username FROM users
WHERE user_id BETWEEN 197 AND 488;

UNION vs JOIN

UNION and JOIN both combine data from multiple tables, but they work in completely different directions. UNION stacks rows vertically. JOIN matches rows horizontally and produces wider result sets. Confusing these two is a sign of weak SQL fundamentals, so interviewers test this distinction regularly.

Aspect	UNION	JOIN
Direction	Vertical: stacks rows on top of each other	Horizontal: combines columns side by side
Column requirement	Both queries must have the same number of columns with compatible types	No column count restriction; result has columns from both tables
Row relationship	No row-level relationship; rows are independent	Rows are matched based on a condition (ON clause)
Typical use	Combining similar data from different sources or time periods	Enriching one entity with related data from another table

-- UNION: stacks rows (vertical)
-- Result: 2 columns, rows from both subsets
SELECT emp_name, department FROM employees WHERE department = 'Engineering'
UNION ALL
SELECT emp_name, department FROM employees WHERE department = 'Sales';

-- JOIN: combines columns (horizontal)
-- Result: columns from both tables, matched by key
SELECT e.emp_name, u.username
FROM employees e
JOIN users u ON e.user_id = u.user_id;

5 UNION Patterns

From basic deduplication to cross-region consolidation, schema normalization, and wrapping UNIONs with aggregation.

Basic UNION (Deduplicated)

UNION combines the results of two or more SELECT statements into a single result set and removes duplicate rows. Both queries must return the same number of columns, and the column types must be compatible. Column names come from the first SELECT statement.

-- Combine active and closed user accounts, remove duplicates
SELECT user_id, username, email
FROM users
WHERE account_status = 'active'

UNION

SELECT user_id, username, email
FROM users
WHERE account_status = 'inactive';

-- If a user appears in both subsets with identical
-- values in all three columns, only one row is returned

UNION ALL (Keep Duplicates)

UNION ALL combines result sets without removing duplicates. It is faster than UNION because it skips the deduplication step. Use UNION ALL when you know the datasets do not overlap, or when duplicates carry meaning (like transaction logs from multiple systems).

-- Combine transactions from multiple yearly batches
SELECT transaction_id, total_amount, transaction_date, '2023' AS batch
FROM transactions
WHERE substr(transaction_date, 1, 4) = '2023'

UNION ALL

SELECT transaction_id, total_amount, transaction_date, '2024' AS batch
FROM transactions
WHERE substr(transaction_date, 1, 4) = '2024'

UNION ALL

SELECT transaction_id, total_amount, transaction_date, '2025' AS batch
FROM transactions
WHERE substr(transaction_date, 1, 4) = '2025';

-- All rows preserved; UNION ALL is the correct choice
-- because transaction_id is unique within each batch

UNION with Different Source Structures

When combining tables with different column structures, you can use NULL or constant values to fill missing columns. Each SELECT must still return the same number of columns with compatible types. This pattern is common when consolidating data from systems that store similar information in different schemas.

-- Combine contact info from two differently structured tables
SELECT
  user_id AS person_id,
  username AS name,
  email,
  'user' AS source
FROM users

UNION ALL

SELECT
  employee_id AS person_id,
  emp_name AS name,
  NULL AS email,        -- employees table has no email column
  'employee' AS source
FROM employees;

UNION for Time-Based Partitioned Tables

Many data warehouses partition large tables by time period. Querying across partitions requires UNION ALL to reassemble the full dataset. This is a real-world pattern you will encounter in Snowflake, BigQuery, and Redshift environments where historical data lives in separate tables.

-- Query across yearly partitions
SELECT transaction_id, user_id, total_amount, transaction_date
FROM transactions
WHERE substr(transaction_date, 1, 4) = '2023'

UNION ALL

SELECT transaction_id, user_id, total_amount, transaction_date
FROM transactions
WHERE substr(transaction_date, 1, 4) = '2024'

UNION ALL

SELECT transaction_id, user_id, total_amount, transaction_date
FROM transactions
WHERE substr(transaction_date, 1, 4) = '2025'
ORDER BY transaction_date DESC
LIMIT 1000;

-- ORDER BY and LIMIT apply to the entire UNION result
-- Wrap in parentheses if your DB requires it

UNION with Aggregation

You can wrap a UNION in a subquery and aggregate the combined result. This is useful for producing summary statistics across multiple sources. The UNION happens first (inside the subquery), then the outer query aggregates the combined rows.

-- Total profit across all regional subsets
SELECT
  region,
  SUM(profit) AS total_profit,
  COUNT(*) AS order_count
FROM (
  SELECT profit, 'US' AS region FROM orders WHERE region = 'US'
  UNION ALL
  SELECT profit, 'EU' AS region FROM orders WHERE region = 'EU'
  UNION ALL
  SELECT profit, 'APAC' AS region FROM orders WHERE region = 'APAC'
) combined
GROUP BY region
ORDER BY total_profit DESC;

Common UNION Pitfalls

These mistakes produce queries that either fail silently or return incorrect results.

Using UNION when UNION ALL is correct

UNION deduplicates by comparing all columns. On large datasets (millions of rows), this sort/hash operation is expensive and unnecessary when the sources do not overlap.
Fix: Default to UNION ALL. Only switch to UNION when you have confirmed that duplicates exist and must be removed.

Swapped column order in one SELECT

UNION matches columns by position. If one SELECT returns (name, email) and another returns (email, name), the query runs without error but mixes names into the email column and vice versa.
Fix: Always list columns explicitly (never SELECT *) and verify the order matches across all SELECTs. Use column aliases in the first SELECT to name the output.

ORDER BY on individual SELECTs instead of the final result

In most databases, ORDER BY on an individual SELECT within a UNION is either ignored or causes a syntax error. The UNION operation does not guarantee row order from individual queries.
Fix: Place ORDER BY after the last SELECT in the UNION chain. If you need to limit rows from individual queries, wrap each in a subquery with its own ORDER BY and LIMIT.

4 UNION Interview Questions

Q1: What is the difference between UNION and UNION ALL? When would you use each?

What they test: Basic set operation knowledge and performance awareness. UNION deduplicates. UNION ALL does not. The interviewer wants you to explain the performance cost of UNION and identify when UNION ALL is both correct and preferred. Approach: UNION removes duplicate rows from the combined result, which requires a sort or hash operation. UNION ALL keeps all rows and skips that step. Use UNION ALL when the sources do not overlap (different regions, different time periods), when duplicates are meaningful (transaction logs), or when the downstream consumer handles deduplication. Default to UNION ALL in data pipelines and switch to UNION only when deduplication is explicitly required.

Q2: Can you UNION two queries that have different column names? Different column types?

What they test: Understanding of UNION column matching rules. The interviewer checks whether you know that column names come from the first SELECT and that types must be compatible but not identical. Approach: Different column names are fine. The result set uses column names from the first SELECT. Different column types depend on the database: most databases perform implicit type coercion (INT and BIGINT, VARCHAR(50) and VARCHAR(100)). If types are incompatible (VARCHAR and DATE with no implicit cast), the query fails. Best practice: use explicit CAST to make types match and use column aliases in the first SELECT to name the output clearly.

Q3: What is the difference between UNION and JOIN? When would you use each?

What they test: Whether you understand the fundamental distinction between vertical (UNION) and horizontal (JOIN) combination. Some candidates confuse these because both combine data from multiple tables. The interviewer wants a clear mental model. Approach: UNION stacks rows vertically: it appends one result set below another. JOIN combines columns horizontally: it matches rows from different tables based on a condition and produces wider rows with columns from both tables. UNION requires matching column counts and types. JOIN requires a matching condition but has no column restrictions.

Q4: You need to combine data from 12 monthly tables into a single query. How do you approach this, and what performance considerations matter?

What they test: Real-world data engineering judgment. Monthly partitioned tables are common. The interviewer wants to hear about UNION ALL (not UNION), predicate pushdown, and whether the query optimizer can prune partitions. Approach: Use UNION ALL because monthly tables have non-overlapping data. Add WHERE clauses to each SELECT to push filters down to individual tables (the optimizer may not do this automatically for all databases). Consider whether a view or table function can abstract the UNION ALL so consumers do not need to know about the partitioning. For very large datasets, check if the query plan applies predicates before the UNION or after. Materializing intermediate results may help if the combined dataset is too large to sort.

Frequently asked questions

What does SQL UNION do?+

UNION combines the result sets of two or more SELECT statements into a single result set. It stacks rows vertically: all rows from the first query, followed by all rows from the second query. UNION removes duplicate rows from the combined result. Both SELECT statements must return the same number of columns, and the column data types must be compatible.

What is the difference between UNION and UNION ALL?+

UNION removes duplicate rows from the combined result, which requires a sort or hash deduplication step. UNION ALL keeps all rows and does not perform deduplication. UNION ALL is faster because it skips that step. Use UNION when you need distinct rows from overlapping sources. Use UNION ALL when the sources do not overlap or when duplicates are meaningful.

Can I use ORDER BY with UNION?+

Yes. Place the ORDER BY clause after the last SELECT statement in the UNION. It applies to the entire combined result set, not just the last query. Some databases require wrapping each SELECT in parentheses when using ORDER BY with UNION. You can ORDER BY column position (ORDER BY 1) or by column name from the first SELECT.

How is UNION different from JOIN?+

UNION combines rows vertically: it stacks one result set on top of another. JOIN combines columns horizontally: it matches rows from two tables based on a condition and produces wider rows with columns from both tables. UNION requires the same number of columns with compatible types. JOIN has no column count restriction but requires a matching condition. Use UNION to combine similar data from different sources. Use JOIN to enrich entities with related data.

02 / Why practice

The Fan-In Primitive Every Pipeline Eventually Needs

01
Active recall beats re-reading by 50%
Cognitive-science meta-reviews (Dunlosky et al., 2013) rank practice testing as a top-tier study technique, while re-reading and highlighting rank near the bottom
02
76% of hiring managers reject on the coding task, not the resume
From HackerRank's 2024 Developer Skills Report. Candidates who look strong on paper still fail the live screen if they haven't done timed, executable practice
03
Five problem shapes cover 80% of data engineer loops
Dedup, sessionization, top-N-per-group, slowly-changing dimensions, partition tricks. Writing the shapes by hand turns the unfamiliar into pattern recognition

Try a SQL Problem