Every non-trivial pipeline query you've ever shipped is a tree. Subqueries are how you build that tree inline. In our corpus they appear in 19% of verified rounds and CTEs in 8%, which tells you most interviewers still reach for the older construct when the problem gets layered.
The architecture question isn't "can you write a subquery." It's where a subquery fits in your query plan versus a derived table versus a CTE versus a join. Each choice has a different shape in the optimizer's output and a different failure mode in production. This guide maps all four against the patterns interviewers actually ask.
Questions using subqueries
CTE frequency
Onsite SQL rounds analyzed
Subquery shapes to know cold
Source: DataDriven analysis of 1,042 verified data engineering interview rounds.
A subquery is a nested SELECT that feeds a parent query. In a query plan it shows up as a subplan node, materialized once or executed per row depending on whether it's correlated. Three shapes map to every use case you'll see in a pipeline or an interview whiteboard.
Returns exactly one row and one column. Used wherever a single value is expected: in the SELECT list, in a WHERE comparison, or as a default value. If the subquery returns more than one row, the engine throws an error.
Returns multiple rows (and optionally multiple columns). Used with IN, EXISTS, ANY, ALL, or in the FROM clause as a derived table. This is the most common category in interviews.
References a column from the outer query, so it re-executes for every row. More powerful than non-correlated subqueries, but potentially slower. Required for row-by-row comparisons like “find employees earning above their department average.”
Each type solves a different class of problem. Interviewers expect you to pick the right one without prompting.
Find all employees paid more than the company average. The subquery computes one number; the outer query filters against it. This runs once because the inner SELECT has no reference to the outer table.
SELECT name, salary, department_id
FROM employees
WHERE salary > (
SELECT AVG(salary)
FROM employees
);Interview note: This is the most common subquery question at phone-screen level. The interviewer checks whether you know to use a scalar subquery vs. a self-join. Both work, but the subquery is 3 lines shorter.
Find customers who placed at least one order in the last 30 days. The subquery returns a list of customer_ids. The outer query filters against that list. The optimizer typically converts this into a semi-join.
SELECT customer_id, name, email
FROM customers
WHERE customer_id IN (
SELECT DISTINCT customer_id
FROM orders
WHERE order_date >= CURRENT_DATE - INTERVAL '30 days'
);Interview note: If the interviewer asks you to rewrite this without IN, use EXISTS. Both produce identical execution plans in PostgreSQL 14+ and Snowflake. The choice is style, not performance.
EXISTS returns TRUE the moment the subquery finds its first matching row. It does not need to scan the entire inner result set. This makes EXISTS slightly more efficient than IN when the inner table is large and unindexed.
SELECT d.department_name
FROM departments d
WHERE EXISTS (
SELECT 1
FROM employees e
WHERE e.department_id = d.id
AND e.hire_date >= '2024-01-01'
);Interview note: “SELECT 1” inside EXISTS is convention. The database ignores the SELECT list entirely. You could write SELECT * or SELECT 42. It does not affect correctness or performance.
Place a subquery in the FROM clause to create an inline table. This is useful when you need to aggregate first, then filter or join the aggregated result. Before CTEs became widespread, derived tables were the primary way to layer transformations.
SELECT dept_name, avg_salary
FROM (
SELECT
d.department_name AS dept_name,
AVG(e.salary) AS avg_salary
FROM employees e
JOIN departments d ON e.department_id = d.id
GROUP BY d.department_name
) AS dept_averages
WHERE avg_salary > 85000;Interview note: Most interviewers prefer the CTE version of this pattern. But some older codebases and Redshift workloads still use derived tables heavily. Know both.
Find employees whose salary exceeds their own department's average. The inner query references e.department_id from the outer query, so it re-runs for each row. This is the only subquery type that cannot be replaced by a non-correlated approach without restructuring the logic.
SELECT e.name, e.salary, e.department_id
FROM employees e
WHERE e.salary > (
SELECT AVG(e2.salary)
FROM employees e2
WHERE e2.department_id = e.department_id
);Interview note: This is a top-5 SQL interview question. The faster alternative uses a CTE or window function (AVG() OVER (PARTITION BY department_id)), but interviewers sometimes specifically ask for the correlated version to test whether you understand row-by-row execution.
Interviewers frequently ask when to pick one over the other. The answer depends on readability, reuse, and what columns you need in the output.
| Criteria | Subquery | CTE | JOIN |
|---|---|---|---|
| Readability at 1 level | Good | Good | Good |
| Readability at 3+ levels | Poor (deep nesting) | Best (named steps) | Moderate |
| Reuse same result twice | Must duplicate | Reference by name | Must duplicate |
| Columns in output | Outer table only | Any CTE column | Both tables |
| Existence checks | EXISTS / NOT EXISTS | Possible but verbose | LEFT JOIN + IS NULL |
| Recursion | Not supported | WITH RECURSIVE | Not supported |
Find departments where the average salary exceeds 90,000.
Subquery (derived table)
SELECT dept_name, avg_sal
FROM (
SELECT department_name AS dept_name, AVG(salary) AS avg_sal
FROM employees e
JOIN departments d ON e.department_id = d.id
GROUP BY department_name
) sub
WHERE avg_sal > 90000;CTE
WITH dept_avg AS (
SELECT department_name, AVG(salary) AS avg_sal
FROM employees e
JOIN departments d ON e.department_id = d.id
GROUP BY department_name
)
SELECT department_name, avg_sal
FROM dept_avg
WHERE avg_sal > 90000;JOIN with HAVING
SELECT d.department_name, AVG(e.salary) AS avg_sal
FROM employees e
JOIN departments d ON e.department_id = d.id
GROUP BY d.department_name
HAVING AVG(e.salary) > 90000;Interview note: All three produce identical results. The HAVING approach is shortest for this specific case. But the moment you need a second filter on the aggregated data (e.g., rank departments), you will need the CTE or subquery approach. Show the interviewer you know the tradeoffs.
A correlated subquery depends on values from the outer query. The database cannot run it once and cache the result. It re-evaluates the inner SELECT for every row the outer query produces. This has real performance consequences, but there are problems that genuinely require it.
For a table with 10,000 rows, a correlated subquery runs the inner SELECT up to 10,000 times. The optimizer may cache results for repeated input values (a technique called “subquery result caching”), but you should not count on it. Measure with EXPLAIN ANALYZE.
-- For each order, find the most recent order from the same customer
SELECT o.order_id, o.customer_id, o.order_date, o.total
FROM orders o
WHERE o.order_date = (
SELECT MAX(o2.order_date)
FROM orders o2
WHERE o2.customer_id = o.customer_id
);Small tables (under 10K rows): Correlated subqueries run fine. The optimizer handles them without issues. Do not over-optimize in an interview.
Medium tables (10K to 1M rows): Performance depends on indexing. If the correlated column (e.g., customer_id) is indexed, the inner query uses an index seek for each iteration. Without an index, you get a full scan per row.
Large tables (1M+ rows): Rewrite to a window function or a CTE with a join. The correlated approach becomes impractical. In interviews, mentioning this threshold shows production awareness.
The same “latest order per customer” problem rewritten with ROW_NUMBER. One pass over the data instead of N correlated lookups.
WITH ranked AS (
SELECT
order_id, customer_id, order_date, total,
ROW_NUMBER() OVER (
PARTITION BY customer_id ORDER BY order_date DESC
) AS rn
FROM orders
)
SELECT order_id, customer_id, order_date, total
FROM ranked
WHERE rn = 1;Interview note: If the interviewer asks for a correlated subquery, write it. Then say: “For tables over a few hundred thousand rows, I would use ROW_NUMBER with a CTE instead.” This demonstrates you know both patterns and understand the performance boundary.
These cover the range from phone screen to onsite. Each tests a different aspect of subquery knowledge.
What they test:
Basic scalar subquery syntax. Can you write a WHERE clause that compares against an aggregated value? This is the simplest subquery question and appears in nearly every phone screen that covers SQL.
Approach:
WHERE salary > (SELECT AVG(salary) FROM employees). One scalar subquery. Do not overthink it. State that the subquery runs once and returns a single number.
What they test:
Anti-join pattern. Three valid approaches exist: NOT IN, NOT EXISTS, and LEFT JOIN with IS NULL. The interviewer wants to see if you know the NULL trap with NOT IN.
Approach:
Use NOT EXISTS (SELECT 1 FROM orders WHERE orders.customer_id = customers.id). Mention that NOT IN breaks if the subquery returns any NULLs. This detail alone sets you apart from most candidates.
What they test:
Correlated subquery vs. window function. The interviewer is checking whether you can write the correlated version AND whether you know the window function alternative.
Approach:
Correlated: WHERE e.salary > (SELECT AVG(salary) FROM employees WHERE department_id = e.department_id). Then offer the window function rewrite using AVG() OVER (PARTITION BY department_id).
What they test:
Nested or correlated subquery for ranking. This is a classic “Nth highest” problem. The subquery approach counts how many distinct salaries are higher.
Approach:
Use DENSE_RANK() OVER (PARTITION BY department_id ORDER BY salary DESC) in a CTE, then filter WHERE rank = 2. If forced to use subqueries only, count distinct higher salaries: WHERE (SELECT COUNT(DISTINCT salary) FROM employees e2 WHERE e2.department_id = e.department_id AND e2.salary > e.salary) = 1.
What they test:
Code refactoring judgment. The interviewer hands you a deeply nested query and watches how you decompose it. They care about your reasoning, not just the rewrite.
Approach:
Extract each subquery into a named CTE, starting from the innermost. Name each CTE after what it computes (e.g., daily_totals, ranked_products). State that CTEs win for readability and reuse, but for a one-off scalar comparison, a subquery is fine.
What they test:
NULL awareness and defensive SQL. This is a knowledge question, not a coding question. The answer is entirely about how NULLs interact with NOT IN.
Approach:
If the subquery in NOT IN returns even a single NULL, the entire NOT IN predicate evaluates to UNKNOWN, and zero rows come back. NOT EXISTS handles NULLs correctly because it checks for row existence, not value equality. Always default to NOT EXISTS unless you are certain the inner column has no NULLs.
These are not obscure edge cases. They appear in real interview debriefs as reasons candidates get rejected.
If the subquery in a NOT IN clause returns any NULL value, the entire predicate yields no rows. This happens because SQL uses three-valued logic: comparing a value to NULL produces UNKNOWN, and NOT IN requires every comparison to be FALSE. One NULL makes the whole thing UNKNOWN.
-- BUG: returns zero rows if any manager_id is NULL
SELECT name FROM employees
WHERE id NOT IN (
SELECT manager_id FROM employees
);
-- FIX: use NOT EXISTS
SELECT name FROM employees e
WHERE NOT EXISTS (
SELECT 1 FROM employees m
WHERE m.manager_id = e.id
);This mistake eliminates more candidates than any other subquery error. Add a WHERE ... IS NOT NULL filter inside the subquery if you must use NOT IN.
A correlated subquery re-executes for every row of the outer query. On a 500K-row table, that means 500K inner queries. If the correlated column is not indexed, each inner query does a full table scan. You end up with 500K * 500K = 250 billion row reads in the worst case.
-- Slow: full scan of orders for EACH customer row
SELECT c.name, (
SELECT COUNT(*)
FROM orders o
WHERE o.customer_id = c.id
) AS order_count
FROM customers c;
-- Better: JOIN and GROUP BY
SELECT c.name, COUNT(o.order_id) AS order_count
FROM customers c
LEFT JOIN orders o ON c.id = o.customer_id
GROUP BY c.name;In interviews, write the correlated version if asked, then immediately offer the JOIN rewrite and explain why it scales better.
A scalar subquery must return exactly one row. If it returns two or more, the database throws an error. This often happens when you forget a GROUP BY or when the data has unexpected duplicates. The query works in development (where the test data happens to have one match) and breaks in production.
-- ERROR if a customer has multiple addresses
SELECT name, (
SELECT city FROM addresses WHERE customer_id = c.id
) AS city
FROM customers c;
-- FIX: add LIMIT 1 or aggregate
SELECT name, (
SELECT city FROM addresses
WHERE customer_id = c.id
ORDER BY created_at DESC LIMIT 1
) AS city
FROM customers c;Always assume a subquery could return multiple rows unless you have a UNIQUE constraint guaranteeing otherwise. Add LIMIT 1 with an ORDER BY, or use MAX/MIN to force a scalar.
Real pipeline work rewards the engineer who can sketch a query plan before writing SQL. Practice the same way: decompose the problem, then write the subquery, then run it.