UNION is the vertical glue in every pipeline. Where JOIN matches rows side by side, UNION stacks them end to end. Architecturally it's the primitive behind every "combine these regional shards" job, every "merge archive table with hot table" query, and every historical rollup that spans partitioned storage.
Set operations show up in roughly 8% of pipeline-focused interview rounds in our corpus. Lower frequency than the join family, but the questions lean heavier on system-design framing. Where this fits in a warehouse: the unified view over partition-by-month tables, the CDC-plus-backfill reconciliation query, the multi-tenant fan-in. This page walks syntax, the UNION versus UNION ALL trade-off, and the architectural patterns interviewers probe.
Pipeline rounds with set ops
Flavors (UNION vs UNION ALL)
Architecture patterns
Shards you can fan in
Source: DataDriven analysis of 1,042 verified data engineering interview rounds.
The contract is strict by design. Each SELECT contributes the same column count in the same positional order, and the engine reconciles compatible types via implicit casts. The first SELECT wins the column names, which is why every pipeline author learns to put the canonical schema at the top. It's a tiny convention with an outsized effect on downstream views.
-- Basic UNION (removes duplicates)
SELECT column1, column2 FROM table_a
UNION
SELECT column1, column2 FROM table_b;
-- UNION ALL (keeps duplicates, faster)
SELECT column1, column2 FROM table_a
UNION ALL
SELECT column1, column2 FROM table_b;
-- Multiple UNIONs with ORDER BY
SELECT name, revenue, 'Q1' AS quarter FROM q1_sales
UNION ALL
SELECT name, revenue, 'Q2' AS quarter FROM q2_sales
UNION ALL
SELECT name, revenue, 'Q3' AS quarter FROM q3_sales
ORDER BY revenue DESC;Column matching rule: Columns are matched by position, not by name. The first column of each SELECT is combined, then the second, and so on. If you accidentally swap two columns in one SELECT, the query still runs but produces incorrect results.
The only difference is deduplication. UNION removes duplicate rows. UNION ALL does not. This single difference has major performance implications on large datasets because deduplication requires sorting or hashing the entire combined result.
| Behavior | UNION | UNION ALL |
|---|---|---|
| Duplicate handling | Removes duplicate rows from the combined result | Keeps all rows, including duplicates |
| Performance | Slower: requires sort or hash to deduplicate | Faster: no deduplication step |
| Row count | May be fewer rows than the sum of both queries | Always equals the sum of both queries' row counts |
| Use when | You need distinct results from overlapping datasets | Datasets do not overlap, or you want to preserve duplicates |
| Sort operation | Implicit sort/hash for dedup (can be expensive on large sets) | No implicit sort; rows appear in query order |
-- Table A: (1, 'Alice'), (2, 'Bob')
-- Table B: (2, 'Bob'), (3, 'Carol')
-- UNION: 3 rows (Bob appears once)
SELECT id, name FROM table_a
UNION
SELECT id, name FROM table_b;
-- 1, Alice
-- 2, Bob
-- 3, Carol
-- UNION ALL: 4 rows (Bob appears twice)
SELECT id, name FROM table_a
UNION ALL
SELECT id, name FROM table_b;
-- 1, Alice
-- 2, Bob
-- 2, Bob
-- 3, CarolUNION and JOIN both combine data from multiple tables, but they work in completely different directions. UNION stacks rows vertically. JOIN matches rows horizontally and produces wider result sets. Confusing these two is a sign of weak SQL fundamentals, so interviewers test this distinction regularly.
| Aspect | UNION | JOIN |
|---|---|---|
| Direction | Vertical: stacks rows on top of each other | Horizontal: combines columns side by side |
| Column requirement | Both queries must have the same number of columns with compatible types | No column count restriction; result has columns from both tables |
| Row relationship | No row-level relationship; rows are independent | Rows are matched based on a condition (ON clause) |
| Typical use | Combining similar data from different sources or time periods | Enriching one entity with related data from another table |
-- UNION: stacks rows (vertical)
-- Result: 2 columns, rows from both tables
SELECT name, email FROM employees
UNION ALL
SELECT name, email FROM contractors;
-- JOIN: combines columns (horizontal)
-- Result: columns from both tables, matched by key
SELECT e.name, d.department_name
FROM employees e
JOIN departments d ON e.dept_id = d.dept_id;From basic deduplication to cross-region consolidation, schema normalization, and wrapping UNIONs with aggregation.
UNION combines the results of two or more SELECT statements into a single result set and removes duplicate rows. Both queries must return the same number of columns, and the column types must be compatible. Column names come from the first SELECT statement.
-- Combine active and archived customers, remove duplicates
SELECT customer_id, name, email
FROM active_customers
UNION
SELECT customer_id, name, email
FROM archived_customers;
-- If a customer appears in both tables with identical
-- values in all three columns, only one row is returnedUNION ALL combines result sets without removing duplicates. It is faster than UNION because it skips the deduplication step. Use UNION ALL when you know the datasets do not overlap, or when duplicates carry meaning (like transaction logs from multiple systems).
-- Combine transactions from multiple regional databases
SELECT txn_id, amount, txn_date, 'us_east' AS region
FROM us_east.transactions
UNION ALL
SELECT txn_id, amount, txn_date, 'us_west' AS region
FROM us_west.transactions
UNION ALL
SELECT txn_id, amount, txn_date, 'eu' AS region
FROM eu.transactions;
-- All rows preserved; UNION ALL is the correct choice
-- because txn_id is unique within each regionWhen combining tables with different column structures, you can use NULL or constant values to fill missing columns. Each SELECT must still return the same number of columns with compatible types. This pattern is common when consolidating data from systems that store similar information in different schemas.
-- Combine contact info from two differently structured tables
SELECT
employee_id AS person_id,
full_name AS name,
work_email AS email,
'employee' AS source
FROM employees
UNION ALL
SELECT
contractor_id AS person_id,
contractor_name AS name,
NULL AS email, -- contractors table has no email column
'contractor' AS source
FROM contractors;Many data warehouses partition large tables by time period. Querying across partitions requires UNION ALL to reassemble the full dataset. This is a real-world pattern you will encounter in Snowflake, BigQuery, and Redshift environments where historical data lives in separate tables.
-- Query across yearly partitions
SELECT order_id, customer_id, amount, order_date
FROM orders_2023
UNION ALL
SELECT order_id, customer_id, amount, order_date
FROM orders_2024
UNION ALL
SELECT order_id, customer_id, amount, order_date
FROM orders_2025
ORDER BY order_date DESC
LIMIT 1000;
-- ORDER BY and LIMIT apply to the entire UNION result
-- Wrap in parentheses if your DB requires itYou can wrap a UNION in a subquery and aggregate the combined result. This is useful for producing summary statistics across multiple sources. The UNION happens first (inside the subquery), then the outer query aggregates the combined rows.
-- Total revenue across all regional databases
SELECT
region,
SUM(amount) AS total_revenue,
COUNT(*) AS transaction_count
FROM (
SELECT amount, 'North' AS region FROM north_sales
UNION ALL
SELECT amount, 'South' AS region FROM south_sales
UNION ALL
SELECT amount, 'West' AS region FROM west_sales
) combined
GROUP BY region
ORDER BY total_revenue DESC;These mistakes produce queries that either fail silently or return incorrect results.
UNION deduplicates by comparing all columns. On large datasets (millions of rows), this sort/hash operation is expensive and unnecessary when the sources do not overlap. A pipeline that combines daily partitions should always use UNION ALL because the same row cannot exist in two different daily tables.
Fix: Default to UNION ALL. Only switch to UNION when you have confirmed that duplicates exist and must be removed.
UNION matches columns by position. If one SELECT returns (name, email) and another returns (email, name), the query runs without error but mixes names into the email column and vice versa. This bug is invisible without manual inspection of the results.
Fix: Always list columns explicitly (never SELECT *) and verify the order matches across all SELECTs. Use column aliases in the first SELECT to name the output.
In most databases, ORDER BY on an individual SELECT within a UNION is either ignored or causes a syntax error. The UNION operation does not guarantee row order from individual queries. Only an ORDER BY after the final SELECT applies to the combined result.
Fix: Place ORDER BY after the last SELECT in the UNION chain. If you need to limit rows from individual queries, wrap each in a subquery with its own ORDER BY and LIMIT.
These questions test set operation fundamentals, performance awareness, and the ability to choose between UNION and JOIN.
What they test:
Basic set operation knowledge and performance awareness. UNION deduplicates. UNION ALL does not. The interviewer wants you to explain the performance cost of UNION and identify when UNION ALL is both correct and preferred.
Approach:
UNION removes duplicate rows from the combined result, which requires a sort or hash operation. UNION ALL keeps all rows and skips that step. Use UNION ALL when the sources do not overlap (different regions, different time periods), when duplicates are meaningful (transaction logs), or when the downstream consumer handles deduplication. Default to UNION ALL in data pipelines and switch to UNION only when deduplication is explicitly required.
What they test:
Understanding of UNION column matching rules. The interviewer checks whether you know that column names come from the first SELECT and that types must be compatible but not identical.
Approach:
Different column names are fine. The result set uses column names from the first SELECT. Different column types depend on the database: most databases perform implicit type coercion (INT and BIGINT, VARCHAR(50) and VARCHAR(100)). If types are incompatible (VARCHAR and DATE with no implicit cast), the query fails. Best practice: use explicit CAST to make types match and use column aliases in the first SELECT to name the output clearly.
What they test:
Whether you understand the fundamental distinction between vertical (UNION) and horizontal (JOIN) combination. Some candidates confuse these because both combine data from multiple tables. The interviewer wants a clear mental model.
Approach:
UNION stacks rows vertically: it appends one result set below another. JOIN combines columns horizontally: it matches rows from different tables based on a condition. Use UNION to combine similar records from different sources (transactions from multiple regions). Use JOIN to enrich one entity with related data (customers with their orders). UNION requires matching column counts and types. JOIN requires a matching condition but has no column restrictions.
What they test:
Real-world data engineering judgment. Monthly partitioned tables are common. The interviewer wants to hear about UNION ALL (not UNION), predicate pushdown, and whether the query optimizer can prune partitions.
Approach:
Use UNION ALL because monthly tables have non-overlapping data. Add WHERE clauses to each SELECT to push filters down to individual tables (the optimizer may not do this automatically for all databases). Consider whether a view or table function can abstract the UNION ALL so consumers do not need to know about the partitioning. For very large datasets, check if the query plan applies predicates before the UNION or after. Materializing intermediate results may help if the combined dataset is too large to sort.
Build the unified-view query from memory, explain the sort cost of the dedup path, and defend your choice of ALL versus plain UNION. That's the architecture conversation.