The ANSI SQL:1999 standard was the one that finally added recursive queries. Before that, walking an org chart meant looping in application code, writing Oracle's proprietary CONNECT BY, or admitting defeat and denormalizing the whole hierarchy into a closure table. The 1999 committee borrowed the shape from Datalog research going back to the 1980s: a WITH RECURSIVE form with an anchor member and a recursive member joined by UNION ALL. Postgres shipped it in 8.4 (2009), SQLite in 3.8.3 (2014), and MySQL waited until 8.0 in 2018.
A quarter century later the pattern hasn't changed. Anchor row seeds the working set, recursive member produces the next iteration, loop ends when no rows come back. That's the shape the 1999 editors chose, and it's the one every interviewer still expects you to write.
Added to ANSI SQL
Postgres 8.4 ships it
MySQL 8.0 catches up
SQL rounds use CTEs
Source: DataDriven analysis of 1,042 verified data engineering interview rounds.
A recursive CTE has two halves separated by UNION ALL. The anchor member runs first and produces the initial working set. The recursive member then runs against that working set, producing new rows. Those new rows become the working set for the next iteration. This repeats until the recursive member returns an empty result. The engine collects all rows from every iteration into the final output.
The 1999 committee drew this directly from fixpoint semantics in Datalog: start with a seed, apply a rule, collect new facts, repeat until no new facts arrive. Earlier SQL editions had no native loop construct, so engineers leaned on Oracle's CONNECT BY (added in 1985) or wrote nested procedural PL/SQL. The 1999 form replaced both with a declarative two-part structure that every major engine eventually adopted.
WITH RECURSIVE cte_name AS (
-- Anchor member: runs once
SELECT columns
FROM base_table
WHERE starting_condition
UNION ALL
-- Recursive member: runs until zero rows
SELECT columns
FROM base_table
INNER JOIN cte_name ON join_condition
WHERE termination_guard
)
SELECT * FROM cte_name;RECURSIVE keyword: Required in PostgreSQL, MySQL, and SQLite. SQL Server and Oracle omit it (the engine detects self-reference automatically). BigQuery and Snowflake require it.
UNION ALL, not UNION: Always use UNION ALL. Plain UNION deduplicates between iterations, which breaks most hierarchical traversals and can cause the engine to reject the query entirely.
No aggregates in the recursive member: Most engines prohibit GROUP BY, DISTINCT, and aggregate functions in the recursive member. Aggregate in the final SELECT instead.
Interview note: When an interviewer asks you to traverse a hierarchy, start by stating the two-part structure out loud: “The anchor selects the root nodes, and the recursive member joins back to find children.” This signals you understand the mechanism before you write a line of code.
The anchor member is your base case. It defines where the recursion starts. For an org chart, the base case selects the CEO (WHERE manager_id IS NULL). For a date series, the base case selects the start date. For a graph traversal, the base case selects the source node. Getting the anchor right determines whether the rest of the recursion produces correct results.
The recursive member is the step function. It takes the output from the previous iteration and uses it to find the next set of rows. In a hierarchy, this means joining the base table to the CTE on the parent-child relationship. Each iteration goes one level deeper.
-- Org chart: find all employees under manager_id = 42
WITH RECURSIVE reports AS (
-- Base case: direct reports of manager 42
SELECT
id,
name,
manager_id,
1 AS depth
FROM employees
WHERE manager_id = 42
UNION ALL
-- Recursive case: find reports of reports
SELECT
e.id,
e.name,
e.manager_id,
r.depth + 1
FROM employees e
INNER JOIN reports r ON e.manager_id = r.id
WHERE r.depth < 20 -- safety guard
)
SELECT id, name, depth
FROM reports
ORDER BY depth, name;Iteration 0: The anchor finds all employees where manager_id = 42. Suppose that returns Alice, Bob, and Carol. These three rows are the working set.
Iteration 1: The recursive member joins employees to the working set (Alice, Bob, Carol). It finds everyone who reports to Alice, Bob, or Carol. Those new rows become the next working set.
Iteration N: This continues until an iteration produces zero rows. At that point, the engine concatenates all rows from all iterations and returns them.
Interview note: The depth counter serves double duty. It gives you the hierarchy level for output, and it gives you a termination guard. Always include it.
Hierarchies are the most common recursive CTE application. Org charts, category trees, file system paths, bill-of-materials, and comment threads all follow the same parent-child pattern. Without recursive CTEs, you would need either a fixed number of self-joins (which limits depth) or application-layer iteration (which means multiple round trips to the database).
This pattern builds the full path from root to each node as a concatenated string. It is useful for breadcrumb navigation, debugging hierarchy data, and generating display labels.
WITH RECURSIVE category_tree AS (
-- Anchor: top-level categories
SELECT
id,
name,
parent_id,
name::TEXT AS full_path,
1 AS depth
FROM categories
WHERE parent_id IS NULL
UNION ALL
-- Recursive: append child name to parent path
SELECT
c.id,
c.name,
c.parent_id,
ct.full_path || ' > ' || c.name,
ct.depth + 1
FROM categories c
INNER JOIN category_tree ct ON c.parent_id = ct.id
WHERE ct.depth < 30
)
SELECT id, name, full_path, depth
FROM category_tree
ORDER BY full_path;Manufacturing data often has a parts-to-subparts hierarchy. A recursive CTE can “explode” a product into every component at every level, accumulating quantities along the way.
WITH RECURSIVE bom AS (
-- Anchor: top-level product
SELECT
part_id,
component_id,
quantity,
1 AS level
FROM parts_hierarchy
WHERE part_id = 'PRODUCT-001'
UNION ALL
-- Recursive: drill into sub-components
SELECT
ph.part_id,
ph.component_id,
ph.quantity * b.quantity AS quantity,
b.level + 1
FROM parts_hierarchy ph
INNER JOIN bom b ON ph.part_id = b.component_id
WHERE b.level < 15
)
SELECT component_id, SUM(quantity) AS total_needed
FROM bom
GROUP BY component_id
ORDER BY total_needed DESC;Time-series analysis often requires a row for every date in a range, even when no events occurred on certain days. PostgreSQL has generate_series() for this, but MySQL, SQL Server, and many cloud warehouses do not have an equivalent. Recursive CTEs fill the gap portably.
-- Generate every date in Q1 2024
WITH RECURSIVE dates AS (
SELECT DATE '2024-01-01' AS dt
UNION ALL
SELECT dt + INTERVAL '1 day'
FROM dates
WHERE dt < DATE '2024-03-31'
)
SELECT dt
FROM dates;The anchor produces the start date. The recursive member adds one day and stops when the date exceeds the end of the range. The result is 91 rows, one per day. You can then LEFT JOIN this to your events table to fill in zeros for missing days.
WITH RECURSIVE dates AS (
SELECT DATE '2024-01-01' AS dt
UNION ALL
SELECT dt + INTERVAL '1 day'
FROM dates
WHERE dt < DATE '2024-03-31'
),
daily_events AS (
SELECT
DATE(event_time) AS event_date,
COUNT(*) AS event_count
FROM events
WHERE event_time >= '2024-01-01'
AND event_time < '2024-04-01'
GROUP BY DATE(event_time)
)
SELECT
d.dt AS date,
COALESCE(de.event_count, 0) AS events
FROM dates d
LEFT JOIN daily_events de ON d.dt = de.event_date
ORDER BY d.dt;Interview note: When an interviewer asks for daily metrics with no gaps, the date series CTE is the standard answer. Mention it before they ask about missing days. It shows you have dealt with sparse time-series data in production.
Recursive CTEs can walk directed graphs stored as edge lists. This covers social network connections, dependency resolution, and shortest-path approximations. The pattern extends the hierarchy approach by tracking visited nodes to handle cycles.
-- Find all nodes reachable from node 'A'
WITH RECURSIVE reachable AS (
-- Anchor: start at node A
SELECT
source_node,
target_node,
ARRAY[source_node] AS visited,
1 AS hops
FROM edges
WHERE source_node = 'A'
UNION ALL
-- Recursive: follow edges, skip visited nodes
SELECT
e.source_node,
e.target_node,
r.visited || e.target_node,
r.hops + 1
FROM edges e
INNER JOIN reachable r ON e.source_node = r.target_node
WHERE NOT e.target_node = ANY(r.visited)
AND r.hops < 10
)
SELECT DISTINCT target_node, MIN(hops) AS shortest_hops
FROM reachable
GROUP BY target_node
ORDER BY shortest_hops;The visited array prevents infinite loops when the graph has cycles. Each iteration checks whether the target node has already been visited in the current path. This is PostgreSQL-specific syntax; in SQL Server you would concatenate visited node IDs into a VARCHAR and use CHARINDEX for the check.
Data pipeline dependencies form a directed acyclic graph (DAG). A recursive CTE can find all upstream dependencies of a given task, which is useful for impact analysis when a source table changes.
-- Find all upstream dependencies of task 'final_report'
WITH RECURSIVE upstream AS (
SELECT
dependency_task AS task_name,
1 AS distance
FROM task_dependencies
WHERE task_name = 'final_report'
UNION ALL
SELECT
td.dependency_task,
u.distance + 1
FROM task_dependencies td
INNER JOIN upstream u ON td.task_name = u.task_name
WHERE u.distance < 20
)
SELECT DISTINCT task_name, MIN(distance) AS min_distance
FROM upstream
GROUP BY task_name
ORDER BY min_distance;A recursive CTE terminates when the recursive member returns zero rows. This is the implicit termination condition. But relying on it alone is dangerous: bad data (circular references, missing NULLs) can cause infinite recursion. Production code needs explicit guards.
Add a depth or iteration counter. Increment it in the recursive member and filter with WHERE depth < N. This is the most common guard and works on every engine.
-- Stop at depth 50 regardless of data
WHERE ct.depth < 50SQL Server defaults to 100 iterations (change with OPTION MAXRECURSION N). MySQL 8.0+ defaults to 1000 (cte_max_recursion_depth). PostgreSQL has no default limit, so you must add your own guard or set statement_timeout.
For graphs with cycles, accumulate visited IDs in an array column (PostgreSQL) or a concatenated string (SQL Server). Check each new row against the visited set before including it. This detects cycles at the row level rather than relying on a global depth limit.
-- PostgreSQL: array-based cycle detection
WHERE NOT e.target_id = ANY(r.visited_ids)
AND r.depth < 50Interview note: Mentioning termination conditions unprompted is a strong signal. After writing your recursive CTE, say: “In production I would add a depth guard and possibly array-based cycle detection to handle dirty data.”
Recursive CTEs are not free. Each iteration is a separate query execution against the working set. For wide or deep hierarchies, this can be expensive. Understanding the cost model helps you decide when a recursive CTE is the right tool and when an alternative approach (materialized paths, nested sets, or application-layer traversal) is better.
| Factor | Impact |
|---|---|
| Depth of recursion | Each level is a round of execution. 5 levels is fast; 500 can be slow. |
| Branching factor | If each node has 10 children, row count grows exponentially. Filter early. |
| Index on join column | The recursive join must be fast. Index the parent_id / manager_id column. |
| Columns carried forward | Accumulating arrays or long strings per row increases memory per iteration. |
| Alternative data models | Materialized paths or nested sets avoid recursion entirely for read-heavy workloads. |
For most interview problems, recursion depth is under 20 and performance is not a concern. But interviewers will sometimes ask: “What if this table has 10 million rows and 50 levels?” The correct answer is to discuss indexing, depth limits, and whether a different data model (like ltree in PostgreSQL) would serve the access pattern better.
These three questions appear consistently in data engineering interviews at companies that work with hierarchical data, which is most of them.
What they test:
Core recursive CTE syntax. They want anchor + recursive member + UNION ALL. Bonus points for including a depth column and a safety guard.
Approach:
Anchor: SELECT employees where manager_id equals the target. Recursive: JOIN employees where manager_id equals the previous level's id. Include depth counter. Mention WHERE depth < N for cycle safety.
What they test:
Recursive CTE for sequence generation plus gap-filling with LEFT JOIN and COALESCE. This tests whether you know the date spine pattern for time-series reporting.
Approach:
Anchor: SELECT DATE '2024-01-01'. Recursive: add INTERVAL '1 day', stop at 2024-12-31. LEFT JOIN to daily event counts. COALESCE nulls to 0.
What they test:
Graph traversal with cycle detection. This is the hardest recursive CTE question. They want the visited-array pattern and a clear explanation of termination.
Approach:
Anchor: SELECT edges from source node, initialize visited array. Recursive: follow outgoing edges, check NOT target = ANY(visited), increment hop count. Final SELECT: GROUP BY target, MIN(hops) for shortest path. Mention that this is BFS, not Dijkstra (no weighted edges).
Syntax varies slightly across engines. Here is what to know for the platforms you are most likely to encounter.
| Engine | RECURSIVE Keyword | Default Limit |
|---|---|---|
| PostgreSQL | Required | None (add your own guard) |
| SQL Server | Not used | 100 (MAXRECURSION to change) |
| MySQL 8.0+ | Required | 1000 (cte_max_recursion_depth) |
| BigQuery | Required | 500 |
| Snowflake | Required | Configurable per query |
| Oracle | Not used (uses CONNECT BY too) | None |
Twenty-five years after the committee shipped WITH RECURSIVE, candidates still stumble on the anchor/recursive split. Practice until you can write it without looking up the shape.
Complete CTE reference: basic syntax, chaining, CTE vs subquery, CTE vs temp table
Window functions, lateral joins, grouping sets, and every advanced SQL topic for interviews
Every SQL topic tested in data engineering interviews, with approaches and patterns