SQL UNION: UNION vs UNION ALL vs JOIN
UNION is the vertical glue in every pipeline. Where JOIN matches rows side by side, UNION stacks them end to end. Architecturally it's the primitive behind every 'combine these regional shards' job, every 'merge archive table with hot table' query, and every historical rollup that spans partitioned storage.
UNION Syntax
Know UNION the way the interviewer who asks it knows it.
-- Basic UNION (removes duplicates)
SELECT column1, column2 FROM table_a
UNION
SELECT column1, column2 FROM table_b;
-- UNION ALL (keeps duplicates, faster)
SELECT column1, column2 FROM table_a
UNION ALL
SELECT column1, column2 FROM table_b;
-- Multiple UNIONs with ORDER BY
SELECT name, revenue, 'Q1' AS quarter FROM q1_sales
UNION ALL
SELECT name, revenue, 'Q2' AS quarter FROM q2_sales
UNION ALL
SELECT name, revenue, 'Q3' AS quarter FROM q3_sales
ORDER BY revenue DESC;| Behavior | UNION | UNION ALL |
|---|---|---|
| Duplicate handling | Removes duplicate rows from the combined result | Keeps all rows, including duplicates |
| Performance | Slower: requires sort or hash to deduplicate | Faster: no deduplication step |
| Row count | May be fewer rows than the sum of both queries | Always equals the sum of both queries' row counts |
| Use when | You need distinct results from overlapping datasets | Datasets do not overlap, or you want to preserve duplicates |
| Sort operation | Implicit sort/hash for dedup (can be expensive on large sets) | No implicit sort; rows appear in query order |
UNION vs UNION ALL
Monthly Active Users per Endpoint
One endpoint, many users. Which ones showed up?
Pulled from debriefs where SQL was the gate.
-- Table A: (1, 'Alice'), (2, 'Bob')
-- Table B: (2, 'Bob'), (3, 'Carol')
-- UNION: 3 rows (Bob appears once)
SELECT id, name FROM table_a
UNION
SELECT id, name FROM table_b;
-- 1, Alice
-- 2, Bob
-- 3, Carol
-- UNION ALL: 4 rows (Bob appears twice)
SELECT id, name FROM table_a
UNION ALL
SELECT id, name FROM table_b;
-- 1, Alice
-- 2, Bob
-- 2, Bob
-- 3, CarolUNION vs JOIN
| Aspect | UNION | JOIN |
|---|---|---|
| Direction | Vertical: stacks rows on top of each other | Horizontal: combines columns side by side |
| Column requirement | Both queries must have the same number of columns with compatible types | No column count restriction; result has columns from both tables |
| Row relationship | No row-level relationship; rows are independent | Rows are matched based on a condition (ON clause) |
| Typical use | Combining similar data from different sources or time periods | Enriching one entity with related data from another table |
5 UNION Patterns
-- UNION: stacks rows (vertical)
-- Result: 2 columns, rows from both tables
SELECT name, email FROM employees
UNION ALL
SELECT name, email FROM contractors;
-- JOIN: combines columns (horizontal)
-- Result: columns from both tables, matched by key
SELECT e.name, d.department_name
FROM employees e
JOIN departments d ON e.dept_id = d.dept_id;Common UNION Pitfalls
Basic UNION (Deduplicated)
UNION combines the results of two or more SELECT statements into a single result set and removes duplicate rows. Both queries must return the same number of columns, and the column types must be compatible. Column names come from the first SELECT statement.
-- Combine active and archived customers, remove duplicates
SELECT customer_id, name, email
FROM active_customers
UNION
SELECT customer_id, name, email
FROM archived_customers;
-- If a customer appears in both tables with identical
-- values in all three columns, only one row is returnedUNION ALL (Keep Duplicates)
UNION ALL combines result sets without removing duplicates. It is faster than UNION because it skips the deduplication step. Use UNION ALL when you know the datasets do not overlap, or when duplicates carry meaning (like transaction logs from multiple systems).
-- Combine transactions from multiple regional databases
SELECT txn_id, amount, txn_date, 'us_east' AS region
FROM us_east.transactions
UNION ALL
SELECT txn_id, amount, txn_date, 'us_west' AS region
FROM us_west.transactions
UNION ALL
SELECT txn_id, amount, txn_date, 'eu' AS region
FROM eu.transactions;
-- All rows preserved; UNION ALL is the correct choice
-- because txn_id is unique within each regionUNION with Different Source Structures
When combining tables with different column structures, you can use NULL or constant values to fill missing columns. Each SELECT must still return the same number of columns with compatible types. This pattern is common when consolidating data from systems that store similar information in different schemas.
-- Combine contact info from two differently structured tables
SELECT
employee_id AS person_id,
full_name AS name,
work_email AS email,
'employee' AS source
FROM employees
UNION ALL
SELECT
contractor_id AS person_id,
contractor_name AS name,
NULL AS email, -- contractors table has no email column
'contractor' AS source
FROM contractors;UNION for Time-Based Partitioned Tables
Many data warehouses partition large tables by time period. Querying across partitions requires UNION ALL to reassemble the full dataset. This is a real-world pattern you will encounter in Snowflake, BigQuery, and Redshift environments where historical data lives in separate tables.
-- Query across yearly partitions
SELECT order_id, customer_id, amount, order_date
FROM orders_2023
UNION ALL
SELECT order_id, customer_id, amount, order_date
FROM orders_2024
UNION ALL
SELECT order_id, customer_id, amount, order_date
FROM orders_2025
ORDER BY order_date DESC
LIMIT 1000;
-- ORDER BY and LIMIT apply to the entire UNION result
-- Wrap in parentheses if your DB requires itUNION with Aggregation
You can wrap a UNION in a subquery and aggregate the combined result. This is useful for producing summary statistics across multiple sources. The UNION happens first (inside the subquery), then the outer query aggregates the combined rows.
-- Total revenue across all regional databases
SELECT
region,
SUM(amount) AS total_revenue,
COUNT(*) AS transaction_count
FROM (
SELECT amount, 'North' AS region FROM north_sales
UNION ALL
SELECT amount, 'South' AS region FROM south_sales
UNION ALL
SELECT amount, 'West' AS region FROM west_sales
) combined
GROUP BY region
ORDER BY total_revenue DESC;Using UNION when UNION ALL is correct
- UNION deduplicates by comparing all columns. On large datasets (millions of rows), this sort/hash operation is expensive and unnecessary when the sources do not overlap.
- Fix: Default to UNION ALL. Only switch to UNION when you have confirmed that duplicates exist and must be removed.
Swapped column order in one SELECT
- UNION matches columns by position. If one SELECT returns (name, email) and another returns (email, name), the query runs without error but mixes names into the email column and vice versa.
- Fix: Always list columns explicitly (never SELECT *) and verify the order matches across all SELECTs. Use column aliases in the first SELECT to name the output.
ORDER BY on individual SELECTs instead of the final result
- In most databases, ORDER BY on an individual SELECT within a UNION is either ignored or causes a syntax error. The UNION operation does not guarantee row order from individual queries.
- Fix: Place ORDER BY after the last SELECT in the UNION chain. If you need to limit rows from individual queries, wrap each in a subquery with its own ORDER BY and LIMIT.
4 UNION Interview Questions
Q1: What is the difference between UNION and UNION ALL? When would you use each?
What they test: Basic set operation knowledge and performance awareness. UNION deduplicates. UNION ALL does not. The interviewer wants you to explain the performance cost of UNION and identify when UNION ALL is both correct and preferred. Approach: UNION removes duplicate rows from the combined result, which requires a sort or hash operation. UNION ALL keeps all rows and skips that step. Use UNION ALL when the sources do not overlap (different regions, different time periods), when duplicates are meaningful (transaction logs), or when the downstream consumer handles deduplication. Default to UNION ALL in data pipelines and switch to UNION only when deduplication is explicitly required.
Q2: Can you UNION two queries that have different column names? Different column types?
What they test: Understanding of UNION column matching rules. The interviewer checks whether you know that column names come from the first SELECT and that types must be compatible but not identical. Approach: Different column names are fine. The result set uses column names from the first SELECT. Different column types depend on the database: most databases perform implicit type coercion (INT and BIGINT, VARCHAR(50) and VARCHAR(100)). If types are incompatible (VARCHAR and DATE with no implicit cast), the query fails. Best practice: use explicit CAST to make types match and use column aliases in the first SELECT to name the output clearly.
Q3: What is the difference between UNION and JOIN? When would you use each?
What they test: Whether you understand the fundamental distinction between vertical (UNION) and horizontal (JOIN) combination. Some candidates confuse these because both combine data from multiple tables. The interviewer wants a clear mental model. Approach: UNION stacks rows vertically: it appends one result set below another. JOIN combines columns horizontally: it matches rows from different tables based on a condition and produces wider rows with columns from both tables. UNION requires matching column counts and types. JOIN requires a matching condition but has no column restrictions.
Q4: You need to combine data from 12 monthly tables into a single query. How do you approach this, and what performance considerations matter?
What they test: Real-world data engineering judgment. Monthly partitioned tables are common. The interviewer wants to hear about UNION ALL (not UNION), predicate pushdown, and whether the query optimizer can prune partitions. Approach: Use UNION ALL because monthly tables have non-overlapping data. Add WHERE clauses to each SELECT to push filters down to individual tables (the optimizer may not do this automatically for all databases). Consider whether a view or table function can abstract the UNION ALL so consumers do not need to know about the partitioning. For very large datasets, check if the query plan applies predicates before the UNION or after. Materializing intermediate results may help if the combined dataset is too large to sort.
Frequently asked questions
What does SQL UNION do?+
What is the difference between UNION and UNION ALL?+
Can I use ORDER BY with UNION?+
How is UNION different from JOIN?+
The Fan-In Primitive Every Pipeline Eventually Needs
- 01
Active recall beats re-reading by 50%
Cognitive-science meta-reviews (Dunlosky et al., 2013) rank practice testing as a top-tier study technique, while re-reading and highlighting rank near the bottom
- 02
76% of hiring managers reject on the coding task, not the resume
From HackerRank's 2024 Developer Skills Report. Candidates who look strong on paper still fail the live screen if they haven't done timed, executable practice
- 03
Five problem shapes cover 80% of data engineer loops
Dedup, sessionization, top-N-per-group, slowly-changing dimensions, partition tricks. Writing the shapes by hand turns the unfamiliar into pattern recognition