SQL Skills Guide

Advanced SQL for Data Engineers

Most candidates think "advanced SQL" means memorizing obscure functions. Wrong. Advanced SQL means reading an EXPLAIN plan, knowing when RANK ties hurt you, and understanding why your LATERAL join is the only thing standing between a correlated subquery and a production outage. Interviewers don't test whether you've heard of window functions. They test whether you reach for ROW_NUMBER versus RANK without pausing, and whether you can explain the difference in one sentence.

Seven topics below, chosen because interviewers actually ask about them, not because they sound impressive in a blog post. Each one comes with the specific mistake candidates make and how to avoid it.

21%

PARTITION BY frequency

15%

ROW_NUMBER frequency

11%

RANK/DENSE_RANK frequency

8%

CTE frequency

Source: DataDriven analysis of 1,042 verified data engineering interview rounds.

7 Advanced SQL Topics

Each topic includes a concept explanation, a production-style code example, and a link to a deeper reference page.

Window Functions

Deep dive →

Window functions compute values across a set of rows related to the current row without collapsing the result set. Unlike GROUP BY, which produces one row per group, window functions keep every row and add a computed column. ROW_NUMBER, RANK, DENSE_RANK, LAG, LEAD, SUM OVER, and NTILE are the most tested functions. The PARTITION BY clause defines the window groups. ORDER BY within the OVER clause defines the row ordering for ranking and running calculations.

-- Running total and rank within each department
SELECT
  employee_id,
  department,
  salary,
  SUM(salary) OVER (
    PARTITION BY department
    ORDER BY salary DESC
  ) AS running_total,
  RANK() OVER (
    PARTITION BY department
    ORDER BY salary DESC
  ) AS salary_rank
FROM employees;

Common Table Expressions (CTEs)

Deep dive →

CTEs define temporary named result sets that exist only for the duration of a single query. They replace subqueries and make complex SQL readable by breaking it into named steps. Each CTE is defined in the WITH clause and can reference CTEs defined before it. CTEs do not imply materialization: the optimizer may inline them, merge them, or execute them separately depending on the query plan.

-- Multi-step analysis using CTEs
WITH monthly_revenue AS (
  SELECT
    DATE_TRUNC('month', order_date) AS month,
    SUM(amount) AS revenue
  FROM orders
  GROUP BY 1
),
revenue_with_growth AS (
  SELECT
    month,
    revenue,
    LAG(revenue) OVER (ORDER BY month) AS prev_month,
    revenue - LAG(revenue) OVER (ORDER BY month) AS growth
  FROM monthly_revenue
)
SELECT * FROM revenue_with_growth
WHERE growth < 0;  -- months with revenue decline

Recursive Queries

Deep dive →

Recursive CTEs query hierarchical or graph-structured data. The recursive CTE has two parts: the anchor member (base case) and the recursive member (the step that references the CTE itself). The database executes the anchor first, then repeatedly executes the recursive member until it produces no new rows. Common use cases: org charts, bill of materials, category trees, and shortest path calculations.

-- Walk an org chart: find all reports under a manager
WITH RECURSIVE org_tree AS (
  -- Anchor: start with the manager
  SELECT employee_id, name, manager_id, 1 AS depth
  FROM employees
  WHERE employee_id = 100

  UNION ALL

  -- Recursive: find direct reports of the current level
  SELECT e.employee_id, e.name, e.manager_id, t.depth + 1
  FROM employees e
  JOIN org_tree t ON e.manager_id = t.employee_id
  WHERE t.depth < 10  -- safety limit
)
SELECT * FROM org_tree ORDER BY depth, name;

Lateral Joins

Deep dive →

LATERAL allows a subquery in the FROM clause to reference columns from preceding tables. Without LATERAL, each subquery in FROM is independent. With LATERAL, the subquery runs once per row of the preceding table, similar to a correlated subquery but in the FROM clause. This is the most efficient way to get the 'top N per group' pattern without window functions.

-- Top 3 orders per customer using LATERAL
SELECT c.customer_id, c.name, top_orders.*
FROM customers c
CROSS JOIN LATERAL (
  SELECT order_id, amount, order_date
  FROM orders o
  WHERE o.customer_id = c.customer_id
  ORDER BY o.amount DESC
  LIMIT 3
) top_orders;

-- The LATERAL subquery references c.customer_id
-- It runs once per customer, returning up to 3 rows
-- Customers with no orders are excluded (use LEFT JOIN LATERAL to keep them)

Query Optimization

Deep dive →

Writing correct SQL is step one. Writing fast SQL is step two. Query optimization requires understanding indexes (B-tree, hash, GIN, GiST), execution plans (EXPLAIN ANALYZE), join algorithms (nested loop, hash join, merge join), and common anti-patterns (SELECT *, functions on indexed columns, implicit type casts, correlated subqueries). Data engineers must optimize queries that run against tables with billions of rows.

-- Read the execution plan
EXPLAIN ANALYZE
SELECT c.name, COUNT(o.order_id) AS order_count
FROM customers c
LEFT JOIN orders o ON c.customer_id = o.customer_id
WHERE c.signup_date >= '2024-01-01'
GROUP BY c.name
HAVING COUNT(o.order_id) > 5;

-- Look for: Seq Scan (missing index?), Hash Join vs Nested Loop,
-- actual rows vs estimated rows (statistics stale?),
-- execution time per node

Set Operations

Deep dive →

UNION, INTERSECT, and EXCEPT combine or compare result sets. UNION stacks rows vertically. INTERSECT returns rows that appear in both result sets. EXCEPT returns rows from the first set that do not appear in the second. Each has an ALL variant that skips deduplication. These operations follow set theory and are tested in interviews as an alternative to JOIN-based solutions.

-- Customers in both the US and EU databases (intersection)
SELECT customer_id, email FROM us_customers
INTERSECT
SELECT customer_id, email FROM eu_customers;

-- Customers in US but not in EU (set difference)
SELECT customer_id, email FROM us_customers
EXCEPT
SELECT customer_id, email FROM eu_customers;

-- Both operations deduplicate by default
-- Use INTERSECT ALL / EXCEPT ALL to preserve duplicates

Dynamic SQL and Prepared Statements

Deep dive →

Dynamic SQL constructs and executes SQL strings at runtime. It is used for queries where the table name, column list, or filter conditions are not known until execution time. In PostgreSQL, EXECUTE with format() builds safe dynamic queries. In application code, parameterized queries (prepared statements) prevent SQL injection while allowing dynamic values. Data engineers use dynamic SQL for metadata-driven pipelines, automated schema migrations, and generic data quality checks.

-- PostgreSQL: dynamic SQL in a function
CREATE OR REPLACE FUNCTION count_rows(table_name TEXT)
RETURNS BIGINT AS $$
DECLARE
  row_count BIGINT;
BEGIN
  EXECUTE format('SELECT COUNT(*) FROM %I', table_name)
  INTO row_count;
  RETURN row_count;
END;
$$ LANGUAGE plpgsql;

-- %I safely quotes the identifier (prevents SQL injection)
-- Use %L for literal values: format('...WHERE name = %L', val)

4 Advanced SQL Interview Questions

Candidates who have memorized concepts fail these. Candidates who understand the failure modes pass. Each question below names the specific trap interviewers are setting and the one-sentence answer that proves you've seen it before.

Q1: What is the difference between ROW_NUMBER, RANK, and DENSE_RANK? When would you use each?

What they test:

Window function fundamentals. All three assign numbers to rows within a partition. ROW_NUMBER gives unique sequential numbers (1, 2, 3) even for ties. RANK gives the same number to ties but skips values (1, 1, 3). DENSE_RANK gives the same number to ties without skipping (1, 1, 2). The interviewer wants examples of when each is appropriate.

Approach:

ROW_NUMBER: when you need exactly one row per group (deduplication, top-1 per group). RANK: when you need competition-style ranking where ties skip positions (Olympic medals). DENSE_RANK: when you need consecutive rank numbers without gaps (top-N where N means unique rank positions, not unique rows). Give the concrete output for tied values: salaries of 100k, 100k, 90k produce ROW_NUMBER (1,2,3), RANK (1,1,3), DENSE_RANK (1,1,2).

Q2: Explain what a recursive CTE is and give a real-world use case.

What they test:

Whether you have worked with hierarchical data. Org charts, category trees, bill of materials, and graph traversal all require recursive queries. The interviewer checks that you understand the anchor/recursive structure and the termination condition.

Approach:

A recursive CTE has two parts joined by UNION ALL. The anchor member is the starting point (e.g., the root node). The recursive member joins the CTE to itself to find the next level. The database repeats the recursive member until no new rows are produced. Real-world example: find all subordinates under a manager in an org chart. Mention the safety limit (WHERE depth < N) to prevent infinite recursion when the data has cycles.

Q3: A query that worked fine on a small dataset is now taking 10 minutes on production data. Walk through your debugging process.

What they test:

Systematic query optimization skills. The interviewer wants to hear a structured approach, not random guessing. Start with the execution plan, identify the bottleneck, and apply targeted fixes.

Approach:

Step 1: Run EXPLAIN ANALYZE to see the actual execution plan and row estimates. Step 2: Look for sequential scans on large tables (missing index on the filter/join column). Step 3: Check if estimated rows are wildly different from actual rows (stale statistics, run ANALYZE). Step 4: Look at the join algorithm: nested loop on large tables is a red flag. Step 5: Check for functions on indexed columns (WHERE YEAR(date_col) = 2024 prevents index use; rewrite as WHERE date_col >= '2024-01-01'). Step 6: Consider materialized CTEs if the same subquery is evaluated multiple times.

Q4: When would you use a LATERAL join instead of a window function?

What they test:

Understanding of advanced join types and when they outperform window functions. LATERAL is often more efficient for top-N-per-group because it can use an index scan with LIMIT, while window functions must compute the window over all rows before filtering.

Approach:

Window function approach: ROW_NUMBER() OVER (PARTITION BY group ORDER BY value DESC) then filter to rn <= 3. This computes a row number for every row, then discards most of them. LATERAL approach: for each group, run a subquery with ORDER BY and LIMIT 3. If there is an index on (group, value DESC), each LATERAL execution is an index scan returning 3 rows. For large tables with many groups, LATERAL can be orders of magnitude faster because it never processes rows outside the top 3.

Advanced SQL FAQ

What counts as advanced SQL?+
Advanced SQL goes beyond basic SELECT, JOIN, GROUP BY, and WHERE. It includes window functions (ROW_NUMBER, RANK, LAG, LEAD, running totals), CTEs and recursive CTEs, lateral joins, query optimization (EXPLAIN ANALYZE, index design), set operations (INTERSECT, EXCEPT), advanced aggregation (GROUPING SETS, CUBE, ROLLUP), and dynamic SQL. Data engineering interviews test these topics at the senior level.
How important is advanced SQL for data engineering interviews?+
Very important. Most data engineering interviews include at least one question on window functions and one on CTEs. Senior-level interviews also test query optimization, recursive queries, and the ability to choose between different approaches (window function vs LATERAL join, CTE vs subquery). Companies like Meta, Amazon, and Snowflake all test advanced SQL skills in their data engineering interview loops.
Should I learn advanced SQL before or after learning Python for data engineering?+
SQL first. Most data engineering work involves writing and optimizing SQL queries. Python is used for orchestration, API integration, and custom transformations that SQL cannot handle. In interviews, SQL questions appear at every company and every level. Python questions are more common at senior levels. Start with advanced SQL, then add Python as a complement.
What is the best way to practice advanced SQL?+
Practice with problems that require window functions, CTEs, and multi-step logic, not just basic SELECT queries. Write queries against real-world schemas (orders, customers, products) with enough rows to see performance differences. Read execution plans (EXPLAIN ANALYZE) for every query you write. Timed practice matters because interviews have time pressure.

The Trap Candidates Miss. You Won't.

Train the pattern recognition that fires before you start typing. That's the senior SQL skill.