SQL Practice

SQL Subqueries for Data Engineers

Every non-trivial pipeline query you've ever shipped is a tree. Subqueries are how you build that tree inline. In our corpus they appear in 19% of verified rounds and CTEs in 8%, which tells you most interviewers still reach for the older construct when the problem gets layered.

The architecture question isn't "can you write a subquery." It's where a subquery fits in your query plan versus a derived table versus a CTE versus a join. Each choice has a different shape in the optimizer's output and a different failure mode in production. This guide maps all four against the patterns interviewers actually ask.

19%

Questions using subqueries

8%

CTE frequency

122

Onsite SQL rounds analyzed

3

Subquery shapes to know cold

Source: DataDriven analysis of 1,042 verified data engineering interview rounds.

What Is a SQL Subquery?

A subquery is a nested SELECT that feeds a parent query. In a query plan it shows up as a subplan node, materialized once or executed per row depending on whether it's correlated. Three shapes map to every use case you'll see in a pipeline or an interview whiteboard.

Scalar Subquery

Returns exactly one row and one column. Used wherever a single value is expected: in the SELECT list, in a WHERE comparison, or as a default value. If the subquery returns more than one row, the engine throws an error.

Table Subquery

Returns multiple rows (and optionally multiple columns). Used with IN, EXISTS, ANY, ALL, or in the FROM clause as a derived table. This is the most common category in interviews.

Correlated Subquery

References a column from the outer query, so it re-executes for every row. More powerful than non-correlated subqueries, but potentially slower. Required for row-by-row comparisons like “find employees earning above their department average.”

Subquery Types with Examples

Each type solves a different class of problem. Interviewers expect you to pick the right one without prompting.

Scalar Subquery: Compare Against an Aggregate

Find all employees paid more than the company average. The subquery computes one number; the outer query filters against it. This runs once because the inner SELECT has no reference to the outer table.

SELECT name, salary, department_id
FROM employees
WHERE salary > (
  SELECT AVG(salary)
  FROM employees
);

Interview note: This is the most common subquery question at phone-screen level. The interviewer checks whether you know to use a scalar subquery vs. a self-join. Both work, but the subquery is 3 lines shorter.

Table Subquery with IN

Find customers who placed at least one order in the last 30 days. The subquery returns a list of customer_ids. The outer query filters against that list. The optimizer typically converts this into a semi-join.

SELECT customer_id, name, email
FROM customers
WHERE customer_id IN (
  SELECT DISTINCT customer_id
  FROM orders
  WHERE order_date >= CURRENT_DATE - INTERVAL '30 days'
);

Interview note: If the interviewer asks you to rewrite this without IN, use EXISTS. Both produce identical execution plans in PostgreSQL 14+ and Snowflake. The choice is style, not performance.

Table Subquery with EXISTS

EXISTS returns TRUE the moment the subquery finds its first matching row. It does not need to scan the entire inner result set. This makes EXISTS slightly more efficient than IN when the inner table is large and unindexed.

SELECT d.department_name
FROM departments d
WHERE EXISTS (
  SELECT 1
  FROM employees e
  WHERE e.department_id = d.id
    AND e.hire_date >= '2024-01-01'
);

Interview note: “SELECT 1” inside EXISTS is convention. The database ignores the SELECT list entirely. You could write SELECT * or SELECT 42. It does not affect correctness or performance.

Derived Table (Subquery in FROM)

Place a subquery in the FROM clause to create an inline table. This is useful when you need to aggregate first, then filter or join the aggregated result. Before CTEs became widespread, derived tables were the primary way to layer transformations.

SELECT dept_name, avg_salary
FROM (
  SELECT
    d.department_name AS dept_name,
    AVG(e.salary) AS avg_salary
  FROM employees e
  JOIN departments d ON e.department_id = d.id
  GROUP BY d.department_name
) AS dept_averages
WHERE avg_salary > 85000;

Interview note: Most interviewers prefer the CTE version of this pattern. But some older codebases and Redshift workloads still use derived tables heavily. Know both.

Correlated Subquery

Find employees whose salary exceeds their own department's average. The inner query references e.department_id from the outer query, so it re-runs for each row. This is the only subquery type that cannot be replaced by a non-correlated approach without restructuring the logic.

SELECT e.name, e.salary, e.department_id
FROM employees e
WHERE e.salary > (
  SELECT AVG(e2.salary)
  FROM employees e2
  WHERE e2.department_id = e.department_id
);

Interview note: This is a top-5 SQL interview question. The faster alternative uses a CTE or window function (AVG() OVER (PARTITION BY department_id)), but interviewers sometimes specifically ask for the correlated version to test whether you understand row-by-row execution.

Subquery vs CTE vs JOIN

Interviewers frequently ask when to pick one over the other. The answer depends on readability, reuse, and what columns you need in the output.

CriteriaSubqueryCTEJOIN
Readability at 1 levelGoodGoodGood
Readability at 3+ levelsPoor (deep nesting)Best (named steps)Moderate
Reuse same result twiceMust duplicateReference by nameMust duplicate
Columns in outputOuter table onlyAny CTE columnBoth tables
Existence checksEXISTS / NOT EXISTSPossible but verboseLEFT JOIN + IS NULL
RecursionNot supportedWITH RECURSIVENot supported

Same Problem, Three Approaches

Find departments where the average salary exceeds 90,000.

Subquery (derived table)

SELECT dept_name, avg_sal
FROM (
  SELECT department_name AS dept_name, AVG(salary) AS avg_sal
  FROM employees e
  JOIN departments d ON e.department_id = d.id
  GROUP BY department_name
) sub
WHERE avg_sal > 90000;

CTE

WITH dept_avg AS (
  SELECT department_name, AVG(salary) AS avg_sal
  FROM employees e
  JOIN departments d ON e.department_id = d.id
  GROUP BY department_name
)
SELECT department_name, avg_sal
FROM dept_avg
WHERE avg_sal > 90000;

JOIN with HAVING

SELECT d.department_name, AVG(e.salary) AS avg_sal
FROM employees e
JOIN departments d ON e.department_id = d.id
GROUP BY d.department_name
HAVING AVG(e.salary) > 90000;

Interview note: All three produce identical results. The HAVING approach is shortest for this specific case. But the moment you need a second filter on the aggregated data (e.g., rank departments), you will need the CTE or subquery approach. Show the interviewer you know the tradeoffs.

Correlated Subqueries: When They Are the Only Option

A correlated subquery depends on values from the outer query. The database cannot run it once and cache the result. It re-evaluates the inner SELECT for every row the outer query produces. This has real performance consequences, but there are problems that genuinely require it.

How Execution Works

For a table with 10,000 rows, a correlated subquery runs the inner SELECT up to 10,000 times. The optimizer may cache results for repeated input values (a technique called “subquery result caching”), but you should not count on it. Measure with EXPLAIN ANALYZE.

-- For each order, find the most recent order from the same customer
SELECT o.order_id, o.customer_id, o.order_date, o.total
FROM orders o
WHERE o.order_date = (
  SELECT MAX(o2.order_date)
  FROM orders o2
  WHERE o2.customer_id = o.customer_id
);

Performance: When to Worry

Small tables (under 10K rows): Correlated subqueries run fine. The optimizer handles them without issues. Do not over-optimize in an interview.

Medium tables (10K to 1M rows): Performance depends on indexing. If the correlated column (e.g., customer_id) is indexed, the inner query uses an index seek for each iteration. Without an index, you get a full scan per row.

Large tables (1M+ rows): Rewrite to a window function or a CTE with a join. The correlated approach becomes impractical. In interviews, mentioning this threshold shows production awareness.

The Window Function Alternative

The same “latest order per customer” problem rewritten with ROW_NUMBER. One pass over the data instead of N correlated lookups.

WITH ranked AS (
  SELECT
    order_id, customer_id, order_date, total,
    ROW_NUMBER() OVER (
      PARTITION BY customer_id ORDER BY order_date DESC
    ) AS rn
  FROM orders
)
SELECT order_id, customer_id, order_date, total
FROM ranked
WHERE rn = 1;

Interview note: If the interviewer asks for a correlated subquery, write it. Then say: “For tables over a few hundred thousand rows, I would use ROW_NUMBER with a CTE instead.” This demonstrates you know both patterns and understand the performance boundary.

6 Subquery Interview Questions

These cover the range from phone screen to onsite. Each tests a different aspect of subquery knowledge.

Q1: Find all employees earning above the company average salary.

What they test:

Basic scalar subquery syntax. Can you write a WHERE clause that compares against an aggregated value? This is the simplest subquery question and appears in nearly every phone screen that covers SQL.

Approach:

WHERE salary > (SELECT AVG(salary) FROM employees). One scalar subquery. Do not overthink it. State that the subquery runs once and returns a single number.

Q2: Find customers who have never placed an order.

What they test:

Anti-join pattern. Three valid approaches exist: NOT IN, NOT EXISTS, and LEFT JOIN with IS NULL. The interviewer wants to see if you know the NULL trap with NOT IN.

Approach:

Use NOT EXISTS (SELECT 1 FROM orders WHERE orders.customer_id = customers.id). Mention that NOT IN breaks if the subquery returns any NULLs. This detail alone sets you apart from most candidates.

Q3: For each department, find employees earning above that department's average.

What they test:

Correlated subquery vs. window function. The interviewer is checking whether you can write the correlated version AND whether you know the window function alternative.

Approach:

Correlated: WHERE e.salary > (SELECT AVG(salary) FROM employees WHERE department_id = e.department_id). Then offer the window function rewrite using AVG() OVER (PARTITION BY department_id).

Q4: Find the second highest salary in each department.

What they test:

Nested or correlated subquery for ranking. This is a classic “Nth highest” problem. The subquery approach counts how many distinct salaries are higher.

Approach:

Use DENSE_RANK() OVER (PARTITION BY department_id ORDER BY salary DESC) in a CTE, then filter WHERE rank = 2. If forced to use subqueries only, count distinct higher salaries: WHERE (SELECT COUNT(DISTINCT salary) FROM employees e2 WHERE e2.department_id = e.department_id AND e2.salary > e.salary) = 1.

Q5: Rewrite this nested subquery as a CTE. Which do you prefer and why?

What they test:

Code refactoring judgment. The interviewer hands you a deeply nested query and watches how you decompose it. They care about your reasoning, not just the rewrite.

Approach:

Extract each subquery into a named CTE, starting from the innermost. Name each CTE after what it computes (e.g., daily_totals, ranked_products). State that CTEs win for readability and reuse, but for a one-off scalar comparison, a subquery is fine.

Q6: Explain when NOT EXISTS is better than NOT IN.

What they test:

NULL awareness and defensive SQL. This is a knowledge question, not a coding question. The answer is entirely about how NULLs interact with NOT IN.

Approach:

If the subquery in NOT IN returns even a single NULL, the entire NOT IN predicate evaluates to UNKNOWN, and zero rows come back. NOT EXISTS handles NULLs correctly because it checks for row existence, not value equality. Always default to NOT EXISTS unless you are certain the inner column has no NULLs.

3 Subquery Mistakes That Fail Interviews

These are not obscure edge cases. They appear in real interview debriefs as reasons candidates get rejected.

NOT IN with NULLs

If the subquery in a NOT IN clause returns any NULL value, the entire predicate yields no rows. This happens because SQL uses three-valued logic: comparing a value to NULL produces UNKNOWN, and NOT IN requires every comparison to be FALSE. One NULL makes the whole thing UNKNOWN.

-- BUG: returns zero rows if any manager_id is NULL
SELECT name FROM employees
WHERE id NOT IN (
  SELECT manager_id FROM employees
);

-- FIX: use NOT EXISTS
SELECT name FROM employees e
WHERE NOT EXISTS (
  SELECT 1 FROM employees m
  WHERE m.manager_id = e.id
);

This mistake eliminates more candidates than any other subquery error. Add a WHERE ... IS NOT NULL filter inside the subquery if you must use NOT IN.

Correlated Subquery on a Large Table Without an Index

A correlated subquery re-executes for every row of the outer query. On a 500K-row table, that means 500K inner queries. If the correlated column is not indexed, each inner query does a full table scan. You end up with 500K * 500K = 250 billion row reads in the worst case.

-- Slow: full scan of orders for EACH customer row
SELECT c.name, (
  SELECT COUNT(*)
  FROM orders o
  WHERE o.customer_id = c.id
) AS order_count
FROM customers c;

-- Better: JOIN and GROUP BY
SELECT c.name, COUNT(o.order_id) AS order_count
FROM customers c
LEFT JOIN orders o ON c.id = o.customer_id
GROUP BY c.name;

In interviews, write the correlated version if asked, then immediately offer the JOIN rewrite and explain why it scales better.

Subquery Returning Multiple Rows in a Scalar Context

A scalar subquery must return exactly one row. If it returns two or more, the database throws an error. This often happens when you forget a GROUP BY or when the data has unexpected duplicates. The query works in development (where the test data happens to have one match) and breaks in production.

-- ERROR if a customer has multiple addresses
SELECT name, (
  SELECT city FROM addresses WHERE customer_id = c.id
) AS city
FROM customers c;

-- FIX: add LIMIT 1 or aggregate
SELECT name, (
  SELECT city FROM addresses
  WHERE customer_id = c.id
  ORDER BY created_at DESC LIMIT 1
) AS city
FROM customers c;

Always assume a subquery could return multiple rows unless you have a UNIQUE constraint guaranteeing otherwise. Add LIMIT 1 with an ORDER BY, or use MAX/MIN to force a scalar.

SQL Subquery FAQ

What is a subquery in SQL?+
A subquery is a SELECT statement nested inside another SQL statement. It can appear in the WHERE clause, FROM clause, SELECT list, or HAVING clause. The database executes the inner query first (for non-correlated subqueries) and uses its result in the outer query. Subqueries let you break a problem into steps without creating temporary tables.
When should I use a subquery instead of a JOIN?+
Use a subquery when you need a single aggregated value for filtering (e.g., WHERE salary > (SELECT AVG(salary) FROM employees)), when the inner query is logically independent, or when you need EXISTS/NOT EXISTS checks. Use a JOIN when you need columns from both tables in the output. In most engines, the optimizer rewrites simple subqueries into joins anyway, so the choice is often about readability.
Are subqueries slower than JOINs?+
Not necessarily. Modern query optimizers in PostgreSQL, BigQuery, Snowflake, and Trino can flatten subqueries into joins. Correlated subqueries are the exception: they re-execute for each row of the outer query, which can be slow on large tables. Non-correlated subqueries run once and are typically as fast as the equivalent JOIN.
What is a correlated subquery?+
A correlated subquery references a column from the outer query, so it cannot run independently. The database re-evaluates it for every row the outer query processes. Example: SELECT * FROM employees e WHERE salary > (SELECT AVG(salary) FROM employees WHERE department_id = e.department_id). The inner query depends on e.department_id from the outer query.
Can I nest subqueries more than one level deep?+
Yes, SQL allows arbitrary nesting depth. But readability drops fast. Two levels of nesting is the practical limit before you should refactor into CTEs. In interviews, if you find yourself nesting three levels deep, stop and rewrite with WITH clauses. The interviewer is watching how you manage complexity, not how deeply you can nest.

Build the Query Tree on Paper First

Real pipeline work rewards the engineer who can sketch a query plan before writing SQL. Practice the same way: decompose the problem, then write the subquery, then run it.