Advanced SQL: Window Functions, CTEs, Recursive...

Most candidates think 'advanced SQL' means memorizing obscure functions. Wrong. Advanced SQL means reading an EXPLAIN plan, knowing when RANK ties hurt you, and understanding why your LATERAL join is the only thing standing between a correlated subquery and a production outage. Interviewers do not test whether you have heard of window functions. They test whether you reach for the right one.

Source: DataDriven analysis of 1,042 verified data engineering interview rounds.
21%
PARTITION BY frequency
15%
ROW_NUMBER frequency
11%
RANK/DENSE_RANK frequency
8%
CTE frequency

7 Advanced SQL Topics

Each topic includes a concept explanation, a production-style code example, and a link to a deeper reference page.

/sql/window-functions

Window Functions

Window functions compute values across a set of rows related to the current row without collapsing the result set. Unlike GROUP BY, which produces one row per group, window functions keep every row and add a computed column. ROW_NUMBER, RANK, DENSE_RANK, LAG, LEAD, SUM OVER, and NTILE are the most tested functions. The PARTITION BY clause defines the window groups. ORDER BY within the OVER clause defines the row ordering for ranking and running calculations.

Deep dive: /sql/window-functions
/sql/cte

Common Table Expressions (CTEs)

CTEs define temporary named result sets that exist only for the duration of a single query. They replace subqueries and make complex SQL readable by breaking it into named steps. Each CTE is defined in the WITH clause and can reference CTEs defined before it. CTEs do not imply materialization: the optimizer may inline them, merge them, or execute them separately depending on the query plan.

Deep dive: /sql/cte
/sql/cte

Recursive Queries

Recursive CTEs query hierarchical or graph-structured data. The recursive CTE has two parts: the anchor member (base case) and the recursive member (the step that references the CTE itself). The database executes the anchor first, then repeatedly executes the recursive member until it produces no new rows. Common use cases: org charts, bill of materials, category trees, and shortest path calculations.

Deep dive: /sql/cte
/sql/joins-practice

Lateral Joins

LATERAL allows a subquery in the FROM clause to reference columns from preceding tables. Without LATERAL, each subquery in FROM is independent. With LATERAL, the subquery runs once per row of the preceding table, similar to a correlated subquery but in the FROM clause. This is the most efficient way to get the top N per group pattern without window functions.

Deep dive: /sql/joins-practice
/sql-query-optimization

Query Optimization

Writing correct SQL is step one. Writing fast SQL is step two. Query optimization requires understanding indexes (B-tree, hash, GIN, GiST), execution plans (EXPLAIN ANALYZE), join algorithms (nested loop, hash join, merge join), and common anti-patterns (SELECT *, functions on indexed columns, implicit type casts, correlated subqueries). Data engineers must optimize queries that run against tables with billions of rows.

Deep dive: /sql-query-optimization
/sql/union-vs-union-all

Set Operations

UNION, INTERSECT, and EXCEPT combine or compare result sets. UNION stacks rows vertically. INTERSECT returns rows that appear in both result sets. EXCEPT returns rows from the first set that do not appear in the second. Each has an ALL variant that skips deduplication. These operations follow set theory and are tested in interviews as an alternative to JOIN-based solutions.

Deep dive: /sql/union-vs-union-all
/sql-interview-questions

Dynamic SQL and Prepared Statements

Dynamic SQL constructs and executes SQL strings at runtime. It is used for queries where the table name, column list, or filter conditions are not known until execution time. In PostgreSQL, EXECUTE with format() builds safe dynamic queries. In application code, parameterized queries (prepared statements) prevent SQL injection while allowing dynamic values. Data engineers use dynamic SQL for metadata-driven pipelines, automated schema migrations, and generic data quality checks.

Deep dive: /sql-interview-questions

4 Advanced SQL Interview Questions

Q1: What is the difference between ROW_NUMBER, RANK, and DENSE_RANK? When would you use each?

What they test: Window function fundamentals. All three assign numbers to rows within a partition. ROW_NUMBER gives unique sequential numbers (1, 2, 3) even for ties. RANK gives the same number to ties but skips values (1, 1, 3). DENSE_RANK gives the same number to ties without skipping (1, 1, 2). Approach: ROW_NUMBER when you need exactly one row per group (deduplication, top-1 per group). RANK for competition-style ranking where ties skip positions (Olympic medals). DENSE_RANK when you need consecutive rank numbers without gaps. Give the concrete output for tied values: salaries of 100k, 100k, 90k produce ROW_NUMBER (1,2,3), RANK (1,1,3), DENSE_RANK (1,1,2).

Q2: Explain what a recursive CTE is and give a real-world use case.

What they test: Whether you have worked with hierarchical data. Org charts, category trees, bill of materials, and graph traversal all require recursive queries. Approach: A recursive CTE has two parts joined by UNION ALL. The anchor member is the starting point (e.g., the root node). The recursive member joins the CTE to itself to find the next level. The database repeats the recursive member until no new rows are produced. Real-world example: find all subordinates under a manager in an org chart. Mention the safety limit (WHERE depth < N) to prevent infinite recursion when the data has cycles.

Q3: A query that worked fine on a small dataset is now taking 10 minutes on production data. Walk through your debugging process.

What they test: Systematic query optimization skills. The interviewer wants a structured approach, not random guessing. Approach: Step 1: Run EXPLAIN ANALYZE to see the actual execution plan and row estimates. Step 2: Look for sequential scans on large tables (missing index on the filter/join column). Step 3: Check if estimated rows are wildly different from actual rows (stale statistics, run ANALYZE). Step 4: Look at the join algorithm: nested loop on large tables is a red flag. Step 5: Check for functions on indexed columns (WHERE YEAR(date_col) = 2024 prevents index use). Step 6: Consider materialized CTEs if the same subquery is evaluated multiple times.

Q4: When would you use a LATERAL join instead of a window function?

What they test: Understanding of advanced join types and when they outperform window functions. LATERAL is often more efficient for top-N-per-group because it can use an index scan with LIMIT. Approach: Window function approach: ROW_NUMBER() OVER (PARTITION BY group ORDER BY value DESC) then filter to rn <= 3. This computes a row number for every row, then discards most of them. LATERAL approach: for each group, run a subquery with ORDER BY and LIMIT 3. If there is an index on (group, value DESC), each LATERAL execution is an index scan returning 3 rows. For large tables with many groups, LATERAL can be orders of magnitude faster.

Advanced SQL FAQ

What counts as advanced SQL?+
Advanced SQL goes beyond basic SELECT, JOIN, GROUP BY, and WHERE. It includes window functions (ROW_NUMBER, RANK, LAG, LEAD, running totals), CTEs and recursive CTEs, lateral joins, query optimization (EXPLAIN ANALYZE, index design), set operations (INTERSECT, EXCEPT), advanced aggregation (GROUPING SETS, CUBE, ROLLUP), and dynamic SQL. Data engineering interviews test these topics at the senior level.
How important is advanced SQL for data engineering interviews?+
Very important. Most data engineering interviews include at least one question on window functions and one on CTEs. Senior-level interviews also test query optimization, recursive queries, and the ability to choose between different approaches (window function vs LATERAL join, CTE vs subquery). Companies like Meta, Amazon, and Snowflake all test advanced SQL skills in their data engineering interview loops.
Should I learn advanced SQL before or after learning Python for data engineering?+
SQL first. Most data engineering work involves writing and optimizing SQL queries. Python is used for orchestration, API integration, and custom transformations that SQL cannot handle. In interviews, SQL questions appear at every company and every level. Python questions are more common at senior levels. Start with advanced SQL, then add Python as a complement.
What is the best way to practice advanced SQL?+
Practice with problems that require window functions, CTEs, and multi-step logic, not just basic SELECT queries. Write queries against real-world schemas (orders, customers, products) with enough rows to see performance differences. Read execution plans (EXPLAIN ANALYZE) for every query you write. Timed practice matters because interviews have time pressure.
02 / Why practice

The Trap Candidates Miss. You Won't.

  1. 01

    Active recall beats re-reading by 50%

    Cognitive-science meta-reviews (Dunlosky et al., 2013) rank practice testing as a top-tier study technique, while re-reading and highlighting rank near the bottom

  2. 02

    76% of hiring managers reject on the coding task, not the resume

    From HackerRank's 2024 Developer Skills Report. Candidates who look strong on paper still fail the live screen if they haven't done timed, executable practice

  3. 03

    Five problem shapes cover 80% of data engineer loops

    Dedup, sessionization, top-N-per-group, slowly-changing dimensions, partition tricks. Writing the shapes by hand turns the unfamiliar into pattern recognition

Related Guides