SQL Reference

SQL UNION Explained

UNION is the vertical glue in every pipeline. Where JOIN matches rows side by side, UNION stacks them end to end. Architecturally it's the primitive behind every "combine these regional shards" job, every "merge archive table with hot table" query, and every historical rollup that spans partitioned storage.

Set operations show up in roughly 8% of pipeline-focused interview rounds in our corpus. Lower frequency than the join family, but the questions lean heavier on system-design framing. Where this fits in a warehouse: the unified view over partition-by-month tables, the CDC-plus-backfill reconciliation query, the multi-tenant fan-in. This page walks syntax, the UNION versus UNION ALL trade-off, and the architectural patterns interviewers probe.

~8%

Pipeline rounds with set ops

2

Flavors (UNION vs UNION ALL)

5

Architecture patterns

N

Shards you can fan in

Source: DataDriven analysis of 1,042 verified data engineering interview rounds.

UNION Syntax

The contract is strict by design. Each SELECT contributes the same column count in the same positional order, and the engine reconciles compatible types via implicit casts. The first SELECT wins the column names, which is why every pipeline author learns to put the canonical schema at the top. It's a tiny convention with an outsized effect on downstream views.

-- Basic UNION (removes duplicates)
SELECT column1, column2 FROM table_a
UNION
SELECT column1, column2 FROM table_b;

-- UNION ALL (keeps duplicates, faster)
SELECT column1, column2 FROM table_a
UNION ALL
SELECT column1, column2 FROM table_b;

-- Multiple UNIONs with ORDER BY
SELECT name, revenue, 'Q1' AS quarter FROM q1_sales
UNION ALL
SELECT name, revenue, 'Q2' AS quarter FROM q2_sales
UNION ALL
SELECT name, revenue, 'Q3' AS quarter FROM q3_sales
ORDER BY revenue DESC;

Column matching rule: Columns are matched by position, not by name. The first column of each SELECT is combined, then the second, and so on. If you accidentally swap two columns in one SELECT, the query still runs but produces incorrect results.

UNION vs UNION ALL

The only difference is deduplication. UNION removes duplicate rows. UNION ALL does not. This single difference has major performance implications on large datasets because deduplication requires sorting or hashing the entire combined result.

BehaviorUNIONUNION ALL
Duplicate handlingRemoves duplicate rows from the combined resultKeeps all rows, including duplicates
PerformanceSlower: requires sort or hash to deduplicateFaster: no deduplication step
Row countMay be fewer rows than the sum of both queriesAlways equals the sum of both queries' row counts
Use whenYou need distinct results from overlapping datasetsDatasets do not overlap, or you want to preserve duplicates
Sort operationImplicit sort/hash for dedup (can be expensive on large sets)No implicit sort; rows appear in query order
-- Table A: (1, 'Alice'), (2, 'Bob')
-- Table B: (2, 'Bob'), (3, 'Carol')

-- UNION: 3 rows (Bob appears once)
SELECT id, name FROM table_a
UNION
SELECT id, name FROM table_b;
-- 1, Alice
-- 2, Bob
-- 3, Carol

-- UNION ALL: 4 rows (Bob appears twice)
SELECT id, name FROM table_a
UNION ALL
SELECT id, name FROM table_b;
-- 1, Alice
-- 2, Bob
-- 2, Bob
-- 3, Carol

UNION vs JOIN

UNION and JOIN both combine data from multiple tables, but they work in completely different directions. UNION stacks rows vertically. JOIN matches rows horizontally and produces wider result sets. Confusing these two is a sign of weak SQL fundamentals, so interviewers test this distinction regularly.

AspectUNIONJOIN
DirectionVertical: stacks rows on top of each otherHorizontal: combines columns side by side
Column requirementBoth queries must have the same number of columns with compatible typesNo column count restriction; result has columns from both tables
Row relationshipNo row-level relationship; rows are independentRows are matched based on a condition (ON clause)
Typical useCombining similar data from different sources or time periodsEnriching one entity with related data from another table
-- UNION: stacks rows (vertical)
-- Result: 2 columns, rows from both tables
SELECT name, email FROM employees
UNION ALL
SELECT name, email FROM contractors;

-- JOIN: combines columns (horizontal)
-- Result: columns from both tables, matched by key
SELECT e.name, d.department_name
FROM employees e
JOIN departments d ON e.dept_id = d.dept_id;

5 UNION Patterns

From basic deduplication to cross-region consolidation, schema normalization, and wrapping UNIONs with aggregation.

Basic UNION (Deduplicated)

UNION combines the results of two or more SELECT statements into a single result set and removes duplicate rows. Both queries must return the same number of columns, and the column types must be compatible. Column names come from the first SELECT statement.

-- Combine active and archived customers, remove duplicates
SELECT customer_id, name, email
FROM active_customers

UNION

SELECT customer_id, name, email
FROM archived_customers;

-- If a customer appears in both tables with identical
-- values in all three columns, only one row is returned

UNION ALL (Keep Duplicates)

UNION ALL combines result sets without removing duplicates. It is faster than UNION because it skips the deduplication step. Use UNION ALL when you know the datasets do not overlap, or when duplicates carry meaning (like transaction logs from multiple systems).

-- Combine transactions from multiple regional databases
SELECT txn_id, amount, txn_date, 'us_east' AS region
FROM us_east.transactions

UNION ALL

SELECT txn_id, amount, txn_date, 'us_west' AS region
FROM us_west.transactions

UNION ALL

SELECT txn_id, amount, txn_date, 'eu' AS region
FROM eu.transactions;

-- All rows preserved; UNION ALL is the correct choice
-- because txn_id is unique within each region

UNION with Different Source Structures

When combining tables with different column structures, you can use NULL or constant values to fill missing columns. Each SELECT must still return the same number of columns with compatible types. This pattern is common when consolidating data from systems that store similar information in different schemas.

-- Combine contact info from two differently structured tables
SELECT
  employee_id AS person_id,
  full_name AS name,
  work_email AS email,
  'employee' AS source
FROM employees

UNION ALL

SELECT
  contractor_id AS person_id,
  contractor_name AS name,
  NULL AS email,        -- contractors table has no email column
  'contractor' AS source
FROM contractors;

UNION for Time-Based Partitioned Tables

Many data warehouses partition large tables by time period. Querying across partitions requires UNION ALL to reassemble the full dataset. This is a real-world pattern you will encounter in Snowflake, BigQuery, and Redshift environments where historical data lives in separate tables.

-- Query across yearly partitions
SELECT order_id, customer_id, amount, order_date
FROM orders_2023

UNION ALL

SELECT order_id, customer_id, amount, order_date
FROM orders_2024

UNION ALL

SELECT order_id, customer_id, amount, order_date
FROM orders_2025
ORDER BY order_date DESC
LIMIT 1000;

-- ORDER BY and LIMIT apply to the entire UNION result
-- Wrap in parentheses if your DB requires it

UNION with Aggregation

You can wrap a UNION in a subquery and aggregate the combined result. This is useful for producing summary statistics across multiple sources. The UNION happens first (inside the subquery), then the outer query aggregates the combined rows.

-- Total revenue across all regional databases
SELECT
  region,
  SUM(amount) AS total_revenue,
  COUNT(*) AS transaction_count
FROM (
  SELECT amount, 'North' AS region FROM north_sales
  UNION ALL
  SELECT amount, 'South' AS region FROM south_sales
  UNION ALL
  SELECT amount, 'West' AS region FROM west_sales
) combined
GROUP BY region
ORDER BY total_revenue DESC;

Common UNION Pitfalls

These mistakes produce queries that either fail silently or return incorrect results.

Using UNION when UNION ALL is correct

UNION deduplicates by comparing all columns. On large datasets (millions of rows), this sort/hash operation is expensive and unnecessary when the sources do not overlap. A pipeline that combines daily partitions should always use UNION ALL because the same row cannot exist in two different daily tables.

Fix: Default to UNION ALL. Only switch to UNION when you have confirmed that duplicates exist and must be removed.

Swapped column order in one SELECT

UNION matches columns by position. If one SELECT returns (name, email) and another returns (email, name), the query runs without error but mixes names into the email column and vice versa. This bug is invisible without manual inspection of the results.

Fix: Always list columns explicitly (never SELECT *) and verify the order matches across all SELECTs. Use column aliases in the first SELECT to name the output.

ORDER BY on individual SELECTs instead of the final result

In most databases, ORDER BY on an individual SELECT within a UNION is either ignored or causes a syntax error. The UNION operation does not guarantee row order from individual queries. Only an ORDER BY after the final SELECT applies to the combined result.

Fix: Place ORDER BY after the last SELECT in the UNION chain. If you need to limit rows from individual queries, wrap each in a subquery with its own ORDER BY and LIMIT.

4 UNION Interview Questions

These questions test set operation fundamentals, performance awareness, and the ability to choose between UNION and JOIN.

Q1: What is the difference between UNION and UNION ALL? When would you use each?

What they test:

Basic set operation knowledge and performance awareness. UNION deduplicates. UNION ALL does not. The interviewer wants you to explain the performance cost of UNION and identify when UNION ALL is both correct and preferred.

Approach:

UNION removes duplicate rows from the combined result, which requires a sort or hash operation. UNION ALL keeps all rows and skips that step. Use UNION ALL when the sources do not overlap (different regions, different time periods), when duplicates are meaningful (transaction logs), or when the downstream consumer handles deduplication. Default to UNION ALL in data pipelines and switch to UNION only when deduplication is explicitly required.

Q2: Can you UNION two queries that have different column names? Different column types?

What they test:

Understanding of UNION column matching rules. The interviewer checks whether you know that column names come from the first SELECT and that types must be compatible but not identical.

Approach:

Different column names are fine. The result set uses column names from the first SELECT. Different column types depend on the database: most databases perform implicit type coercion (INT and BIGINT, VARCHAR(50) and VARCHAR(100)). If types are incompatible (VARCHAR and DATE with no implicit cast), the query fails. Best practice: use explicit CAST to make types match and use column aliases in the first SELECT to name the output clearly.

Q3: What is the difference between UNION and JOIN? When would you use each?

What they test:

Whether you understand the fundamental distinction between vertical (UNION) and horizontal (JOIN) combination. Some candidates confuse these because both combine data from multiple tables. The interviewer wants a clear mental model.

Approach:

UNION stacks rows vertically: it appends one result set below another. JOIN combines columns horizontally: it matches rows from different tables based on a condition. Use UNION to combine similar records from different sources (transactions from multiple regions). Use JOIN to enrich one entity with related data (customers with their orders). UNION requires matching column counts and types. JOIN requires a matching condition but has no column restrictions.

Q4: You need to combine data from 12 monthly tables into a single query. How do you approach this, and what performance considerations matter?

What they test:

Real-world data engineering judgment. Monthly partitioned tables are common. The interviewer wants to hear about UNION ALL (not UNION), predicate pushdown, and whether the query optimizer can prune partitions.

Approach:

Use UNION ALL because monthly tables have non-overlapping data. Add WHERE clauses to each SELECT to push filters down to individual tables (the optimizer may not do this automatically for all databases). Consider whether a view or table function can abstract the UNION ALL so consumers do not need to know about the partitioning. For very large datasets, check if the query plan applies predicates before the UNION or after. Materializing intermediate results may help if the combined dataset is too large to sort.

SQL UNION FAQ

What does SQL UNION do?+
UNION combines the result sets of two or more SELECT statements into a single result set. It stacks rows vertically: all rows from the first query, followed by all rows from the second query. UNION removes duplicate rows from the combined result. Both SELECT statements must return the same number of columns, and the column data types must be compatible.
What is the difference between UNION and UNION ALL?+
UNION removes duplicate rows from the combined result, which requires a sort or hash deduplication step. UNION ALL keeps all rows and does not perform deduplication. UNION ALL is faster because it skips that step. Use UNION when you need distinct rows from overlapping sources. Use UNION ALL when the sources do not overlap or when duplicates are meaningful.
Can I use ORDER BY with UNION?+
Yes. Place the ORDER BY clause after the last SELECT statement in the UNION. It applies to the entire combined result set, not just the last query. Some databases require wrapping each SELECT in parentheses when using ORDER BY with UNION. You can ORDER BY column position (ORDER BY 1) or by column name from the first SELECT.
How is UNION different from JOIN?+
UNION combines rows vertically: it stacks one result set on top of another. JOIN combines columns horizontally: it matches rows from two tables based on a condition and produces wider rows with columns from both tables. UNION requires the same number of columns with compatible types. JOIN has no column count restriction but requires a matching condition. Use UNION to combine similar data from different sources. Use JOIN to enrich entities with related data.

The Fan-In Primitive Every Pipeline Eventually Needs

Build the unified-view query from memory, explain the sort cost of the dedup path, and defend your choice of ALL versus plain UNION. That's the architecture conversation.