Advanced SQL: Window Functions, CTEs, Recursive...
Most candidates think 'advanced SQL' means memorizing obscure functions. Wrong. Advanced SQL means reading an EXPLAIN plan, knowing when RANK ties hurt you, and understanding why your LATERAL join is the only thing standing between a correlated subquery and a production outage. Interviewers do not test whether you have heard of window functions. They test whether you reach for the right one.
7 Advanced SQL Topics
Each topic includes a concept explanation, a production-style code example, and a link to a deeper reference page.
Window Functions
Window functions compute values across a set of rows related to the current row without collapsing the result set. Unlike GROUP BY, which produces one row per group, window functions keep every row and add a computed column. ROW_NUMBER, RANK, DENSE_RANK, LAG, LEAD, SUM OVER, and NTILE are the most tested functions. The PARTITION BY clause defines the window groups. ORDER BY within the OVER clause defines the row ordering for ranking and running calculations.
Common Table Expressions (CTEs)
CTEs define temporary named result sets that exist only for the duration of a single query. They replace subqueries and make complex SQL readable by breaking it into named steps. Each CTE is defined in the WITH clause and can reference CTEs defined before it. CTEs do not imply materialization: the optimizer may inline them, merge them, or execute them separately depending on the query plan.
Recursive Queries
Recursive CTEs query hierarchical or graph-structured data. The recursive CTE has two parts: the anchor member (base case) and the recursive member (the step that references the CTE itself). The database executes the anchor first, then repeatedly executes the recursive member until it produces no new rows. Common use cases: org charts, bill of materials, category trees, and shortest path calculations.
Lateral Joins
LATERAL allows a subquery in the FROM clause to reference columns from preceding tables. Without LATERAL, each subquery in FROM is independent. With LATERAL, the subquery runs once per row of the preceding table, similar to a correlated subquery but in the FROM clause. This is the most efficient way to get the top N per group pattern without window functions.
Query Optimization
Writing correct SQL is step one. Writing fast SQL is step two. Query optimization requires understanding indexes (B-tree, hash, GIN, GiST), execution plans (EXPLAIN ANALYZE), join algorithms (nested loop, hash join, merge join), and common anti-patterns (SELECT *, functions on indexed columns, implicit type casts, correlated subqueries). Data engineers must optimize queries that run against tables with billions of rows.
Set Operations
UNION, INTERSECT, and EXCEPT combine or compare result sets. UNION stacks rows vertically. INTERSECT returns rows that appear in both result sets. EXCEPT returns rows from the first set that do not appear in the second. Each has an ALL variant that skips deduplication. These operations follow set theory and are tested in interviews as an alternative to JOIN-based solutions.
Dynamic SQL and Prepared Statements
Dynamic SQL constructs and executes SQL strings at runtime. It is used for queries where the table name, column list, or filter conditions are not known until execution time. In PostgreSQL, EXECUTE with format() builds safe dynamic queries. In application code, parameterized queries (prepared statements) prevent SQL injection while allowing dynamic values. Data engineers use dynamic SQL for metadata-driven pipelines, automated schema migrations, and generic data quality checks.
4 Advanced SQL Interview Questions
Q1: What is the difference between ROW_NUMBER, RANK, and DENSE_RANK? When would you use each?
What they test: Window function fundamentals. All three assign numbers to rows within a partition. ROW_NUMBER gives unique sequential numbers (1, 2, 3) even for ties. RANK gives the same number to ties but skips values (1, 1, 3). DENSE_RANK gives the same number to ties without skipping (1, 1, 2). Approach: ROW_NUMBER when you need exactly one row per group (deduplication, top-1 per group). RANK for competition-style ranking where ties skip positions (Olympic medals). DENSE_RANK when you need consecutive rank numbers without gaps. Give the concrete output for tied values: salaries of 100k, 100k, 90k produce ROW_NUMBER (1,2,3), RANK (1,1,3), DENSE_RANK (1,1,2).
Q2: Explain what a recursive CTE is and give a real-world use case.
What they test: Whether you have worked with hierarchical data. Org charts, category trees, bill of materials, and graph traversal all require recursive queries. Approach: A recursive CTE has two parts joined by UNION ALL. The anchor member is the starting point (e.g., the root node). The recursive member joins the CTE to itself to find the next level. The database repeats the recursive member until no new rows are produced. Real-world example: find all subordinates under a manager in an org chart. Mention the safety limit (WHERE depth < N) to prevent infinite recursion when the data has cycles.
Q3: A query that worked fine on a small dataset is now taking 10 minutes on production data. Walk through your debugging process.
What they test: Systematic query optimization skills. The interviewer wants a structured approach, not random guessing. Approach: Step 1: Run EXPLAIN ANALYZE to see the actual execution plan and row estimates. Step 2: Look for sequential scans on large tables (missing index on the filter/join column). Step 3: Check if estimated rows are wildly different from actual rows (stale statistics, run ANALYZE). Step 4: Look at the join algorithm: nested loop on large tables is a red flag. Step 5: Check for functions on indexed columns (WHERE YEAR(date_col) = 2024 prevents index use). Step 6: Consider materialized CTEs if the same subquery is evaluated multiple times.
Q4: When would you use a LATERAL join instead of a window function?
What they test: Understanding of advanced join types and when they outperform window functions. LATERAL is often more efficient for top-N-per-group because it can use an index scan with LIMIT. Approach: Window function approach: ROW_NUMBER() OVER (PARTITION BY group ORDER BY value DESC) then filter to rn <= 3. This computes a row number for every row, then discards most of them. LATERAL approach: for each group, run a subquery with ORDER BY and LIMIT 3. If there is an index on (group, value DESC), each LATERAL execution is an index scan returning 3 rows. For large tables with many groups, LATERAL can be orders of magnitude faster.
Advanced SQL FAQ
What counts as advanced SQL?+
How important is advanced SQL for data engineering interviews?+
Should I learn advanced SQL before or after learning Python for data engineering?+
What is the best way to practice advanced SQL?+
The Trap Candidates Miss. You Won't.
- 01
Active recall beats re-reading by 50%
Cognitive-science meta-reviews (Dunlosky et al., 2013) rank practice testing as a top-tier study technique, while re-reading and highlighting rank near the bottom
- 02
76% of hiring managers reject on the coding task, not the resume
From HackerRank's 2024 Developer Skills Report. Candidates who look strong on paper still fail the live screen if they haven't done timed, executable practice
- 03
Five problem shapes cover 80% of data engineer loops
Dedup, sessionization, top-N-per-group, slowly-changing dimensions, partition tricks. Writing the shapes by hand turns the unfamiliar into pattern recognition