Most candidates think "advanced SQL" means memorizing obscure functions. Wrong. Advanced SQL means reading an EXPLAIN plan, knowing when RANK ties hurt you, and understanding why your LATERAL join is the only thing standing between a correlated subquery and a production outage. Interviewers don't test whether you've heard of window functions. They test whether you reach for ROW_NUMBER versus RANK without pausing, and whether you can explain the difference in one sentence.
Seven topics below, chosen because interviewers actually ask about them, not because they sound impressive in a blog post. Each one comes with the specific mistake candidates make and how to avoid it.
PARTITION BY frequency
ROW_NUMBER frequency
RANK/DENSE_RANK frequency
CTE frequency
Source: DataDriven analysis of 1,042 verified data engineering interview rounds.
Each topic includes a concept explanation, a production-style code example, and a link to a deeper reference page.
Window functions compute values across a set of rows related to the current row without collapsing the result set. Unlike GROUP BY, which produces one row per group, window functions keep every row and add a computed column. ROW_NUMBER, RANK, DENSE_RANK, LAG, LEAD, SUM OVER, and NTILE are the most tested functions. The PARTITION BY clause defines the window groups. ORDER BY within the OVER clause defines the row ordering for ranking and running calculations.
-- Running total and rank within each department
SELECT
employee_id,
department,
salary,
SUM(salary) OVER (
PARTITION BY department
ORDER BY salary DESC
) AS running_total,
RANK() OVER (
PARTITION BY department
ORDER BY salary DESC
) AS salary_rank
FROM employees;CTEs define temporary named result sets that exist only for the duration of a single query. They replace subqueries and make complex SQL readable by breaking it into named steps. Each CTE is defined in the WITH clause and can reference CTEs defined before it. CTEs do not imply materialization: the optimizer may inline them, merge them, or execute them separately depending on the query plan.
-- Multi-step analysis using CTEs
WITH monthly_revenue AS (
SELECT
DATE_TRUNC('month', order_date) AS month,
SUM(amount) AS revenue
FROM orders
GROUP BY 1
),
revenue_with_growth AS (
SELECT
month,
revenue,
LAG(revenue) OVER (ORDER BY month) AS prev_month,
revenue - LAG(revenue) OVER (ORDER BY month) AS growth
FROM monthly_revenue
)
SELECT * FROM revenue_with_growth
WHERE growth < 0; -- months with revenue declineRecursive CTEs query hierarchical or graph-structured data. The recursive CTE has two parts: the anchor member (base case) and the recursive member (the step that references the CTE itself). The database executes the anchor first, then repeatedly executes the recursive member until it produces no new rows. Common use cases: org charts, bill of materials, category trees, and shortest path calculations.
-- Walk an org chart: find all reports under a manager
WITH RECURSIVE org_tree AS (
-- Anchor: start with the manager
SELECT employee_id, name, manager_id, 1 AS depth
FROM employees
WHERE employee_id = 100
UNION ALL
-- Recursive: find direct reports of the current level
SELECT e.employee_id, e.name, e.manager_id, t.depth + 1
FROM employees e
JOIN org_tree t ON e.manager_id = t.employee_id
WHERE t.depth < 10 -- safety limit
)
SELECT * FROM org_tree ORDER BY depth, name;LATERAL allows a subquery in the FROM clause to reference columns from preceding tables. Without LATERAL, each subquery in FROM is independent. With LATERAL, the subquery runs once per row of the preceding table, similar to a correlated subquery but in the FROM clause. This is the most efficient way to get the 'top N per group' pattern without window functions.
-- Top 3 orders per customer using LATERAL
SELECT c.customer_id, c.name, top_orders.*
FROM customers c
CROSS JOIN LATERAL (
SELECT order_id, amount, order_date
FROM orders o
WHERE o.customer_id = c.customer_id
ORDER BY o.amount DESC
LIMIT 3
) top_orders;
-- The LATERAL subquery references c.customer_id
-- It runs once per customer, returning up to 3 rows
-- Customers with no orders are excluded (use LEFT JOIN LATERAL to keep them)Writing correct SQL is step one. Writing fast SQL is step two. Query optimization requires understanding indexes (B-tree, hash, GIN, GiST), execution plans (EXPLAIN ANALYZE), join algorithms (nested loop, hash join, merge join), and common anti-patterns (SELECT *, functions on indexed columns, implicit type casts, correlated subqueries). Data engineers must optimize queries that run against tables with billions of rows.
-- Read the execution plan
EXPLAIN ANALYZE
SELECT c.name, COUNT(o.order_id) AS order_count
FROM customers c
LEFT JOIN orders o ON c.customer_id = o.customer_id
WHERE c.signup_date >= '2024-01-01'
GROUP BY c.name
HAVING COUNT(o.order_id) > 5;
-- Look for: Seq Scan (missing index?), Hash Join vs Nested Loop,
-- actual rows vs estimated rows (statistics stale?),
-- execution time per nodeUNION, INTERSECT, and EXCEPT combine or compare result sets. UNION stacks rows vertically. INTERSECT returns rows that appear in both result sets. EXCEPT returns rows from the first set that do not appear in the second. Each has an ALL variant that skips deduplication. These operations follow set theory and are tested in interviews as an alternative to JOIN-based solutions.
-- Customers in both the US and EU databases (intersection)
SELECT customer_id, email FROM us_customers
INTERSECT
SELECT customer_id, email FROM eu_customers;
-- Customers in US but not in EU (set difference)
SELECT customer_id, email FROM us_customers
EXCEPT
SELECT customer_id, email FROM eu_customers;
-- Both operations deduplicate by default
-- Use INTERSECT ALL / EXCEPT ALL to preserve duplicatesDynamic SQL constructs and executes SQL strings at runtime. It is used for queries where the table name, column list, or filter conditions are not known until execution time. In PostgreSQL, EXECUTE with format() builds safe dynamic queries. In application code, parameterized queries (prepared statements) prevent SQL injection while allowing dynamic values. Data engineers use dynamic SQL for metadata-driven pipelines, automated schema migrations, and generic data quality checks.
-- PostgreSQL: dynamic SQL in a function
CREATE OR REPLACE FUNCTION count_rows(table_name TEXT)
RETURNS BIGINT AS $$
DECLARE
row_count BIGINT;
BEGIN
EXECUTE format('SELECT COUNT(*) FROM %I', table_name)
INTO row_count;
RETURN row_count;
END;
$$ LANGUAGE plpgsql;
-- %I safely quotes the identifier (prevents SQL injection)
-- Use %L for literal values: format('...WHERE name = %L', val)Candidates who have memorized concepts fail these. Candidates who understand the failure modes pass. Each question below names the specific trap interviewers are setting and the one-sentence answer that proves you've seen it before.
What they test:
Window function fundamentals. All three assign numbers to rows within a partition. ROW_NUMBER gives unique sequential numbers (1, 2, 3) even for ties. RANK gives the same number to ties but skips values (1, 1, 3). DENSE_RANK gives the same number to ties without skipping (1, 1, 2). The interviewer wants examples of when each is appropriate.
Approach:
ROW_NUMBER: when you need exactly one row per group (deduplication, top-1 per group). RANK: when you need competition-style ranking where ties skip positions (Olympic medals). DENSE_RANK: when you need consecutive rank numbers without gaps (top-N where N means unique rank positions, not unique rows). Give the concrete output for tied values: salaries of 100k, 100k, 90k produce ROW_NUMBER (1,2,3), RANK (1,1,3), DENSE_RANK (1,1,2).
What they test:
Whether you have worked with hierarchical data. Org charts, category trees, bill of materials, and graph traversal all require recursive queries. The interviewer checks that you understand the anchor/recursive structure and the termination condition.
Approach:
A recursive CTE has two parts joined by UNION ALL. The anchor member is the starting point (e.g., the root node). The recursive member joins the CTE to itself to find the next level. The database repeats the recursive member until no new rows are produced. Real-world example: find all subordinates under a manager in an org chart. Mention the safety limit (WHERE depth < N) to prevent infinite recursion when the data has cycles.
What they test:
Systematic query optimization skills. The interviewer wants to hear a structured approach, not random guessing. Start with the execution plan, identify the bottleneck, and apply targeted fixes.
Approach:
Step 1: Run EXPLAIN ANALYZE to see the actual execution plan and row estimates. Step 2: Look for sequential scans on large tables (missing index on the filter/join column). Step 3: Check if estimated rows are wildly different from actual rows (stale statistics, run ANALYZE). Step 4: Look at the join algorithm: nested loop on large tables is a red flag. Step 5: Check for functions on indexed columns (WHERE YEAR(date_col) = 2024 prevents index use; rewrite as WHERE date_col >= '2024-01-01'). Step 6: Consider materialized CTEs if the same subquery is evaluated multiple times.
What they test:
Understanding of advanced join types and when they outperform window functions. LATERAL is often more efficient for top-N-per-group because it can use an index scan with LIMIT, while window functions must compute the window over all rows before filtering.
Approach:
Window function approach: ROW_NUMBER() OVER (PARTITION BY group ORDER BY value DESC) then filter to rn <= 3. This computes a row number for every row, then discards most of them. LATERAL approach: for each group, run a subquery with ORDER BY and LIMIT 3. If there is an index on (group, value DESC), each LATERAL execution is an index scan returning 3 rows. For large tables with many groups, LATERAL can be orders of magnitude faster because it never processes rows outside the top 3.
Train the pattern recognition that fires before you start typing. That's the senior SQL skill.