The senior-and-up subset of the data engineer SQL catalog. Recursive CTEs for hierarchies and graphs. Gap-and-island for streak detection. Sessionization with LAG and SUM OVER. SCD2 half-open joins. EXPLAIN plan reading. Skew handling with salt-and-rebalance. Idempotent MERGE patterns for late-arriving data.

Advanced SQL in 2026 data engineer interviews means the seven patterns that mid-level loops mostly skip and senior-plus loops live in. Recursive CTEs for org-chart and graph traversal use an anchor (WHERE id equals root) and a recursive clause (JOIN cte ON cte.id equals child.parent_id) with a depth column carried forward to bound recursion at 20 levels. Real HR data sometimes has cycles from bad data entry; the depth guard matters in production, not just in the interview. Gap-and-island for streak detection uses the trick of date minus ROW_NUMBER OVER (PARTITION BY user ORDER BY date), multiplied by INTERVAL 1 day; runs of consecutive dates produce a constant difference, so GROUP BY user and streak_key with HAVING COUNT greater-than-or-equal-to N detects streaks of length N or more. Sessionization with explicit gap thresholds uses LAG to get the previous event timestamp, CASE WHEN gap exceeds 30 minutes THEN 1 ELSE 0 AS new_session, SUM OVER ORDER BY ts to accumulate the session_id.

SCD2 half-open joins use effective_from less-than-or-equal-to event_time AND (effective_to IS NULL OR event_time less-than effective_to). The half-open interval prevents boundary doubling at the changeover microsecond when a dimension row's effective_to equals the next row's effective_from. The closed-interval mistake doubles facts; the open-interval mistake drops them. EXPLAIN plan reading at L5 and above asks the data engineer to identify sequential scan versus index seek, explain why a function in WHERE prevents predicate pushdown (WHERE DATE(timestamp) equals '2026-05-27' blocks the partition pruner; rewrite as WHERE timestamp greater-than-or-equal-to '2026-05-27' AND timestamp less-than '2026-05-28'), identify partition pruning and column-store implications, and explain hash join versus sort-merge join versus nested loop selection.

Skew handling for hot keys is a senior data engineer signal. Identify hot keys with SELECT join_key, COUNT(*) FROM table GROUP BY join_key ORDER BY 2 DESC LIMIT 20. If the top key has 10 times the median count, salt: append a mod-N suffix to the hot key on both sides (CONCAT(key, '_', user_id mod 8)), join on the salted key, aggregate by salted key, then strip the salt and re-aggregate. Trade-off is N-fold replication of the small side plus an extra aggregation pass versus the original one-task-doing-all-work bottleneck. In Spark this is AQE skew-join optimization done automatically; in SQL warehouses (Snowflake, BigQuery, Redshift) the data engineer does it manually or via clustering / micro-partition decisions.

Idempotent MERGE for late-arriving reconciliation is the L5-plus design-and-code question. MERGE INTO daily_revenue d USING (SELECT DATE(event_at) AS revenue_date, SUM(amount) AS revenue FROM events WHERE processed_at greater-than-or-equal-to since GROUP BY 1) src ON d.revenue_date equals src.revenue_date WHEN MATCHED THEN UPDATE SET revenue equals d.revenue plus src.revenue WHEN NOT MATCHED THEN INSERT VALUES (src.revenue_date, src.revenue). The principle is ADD, not REPLACE. Late events for an old date correct yesterday's total instead of overwriting it. Replacing is the silent-bug failure mode interviewers fish for.

Advanced SQL Interview Questions

Senior and staff data engineer SQL interview problems including recursive CTEs, skew handling, and EXPLAIN reading.

Common questions

What is the recursive CTE pattern for an org chart?
Anchor: SELECT id, name, 0 AS depth FROM employees WHERE id equals root_id. Recursive: SELECT e.id, e.name, cte.depth plus 1 FROM employees e JOIN cte ON e.manager_id equals cte.id WHERE cte.depth less-than 20. The depth column bounds the recursion and lets a data engineer answer the natural follow-up about org depth. Real HR data sometimes has cycles from bad data entry; the depth guard matters in production.
How does the gap-and-island pattern work?
For each user, take the activity date and subtract ROW_NUMBER OVER (PARTITION BY user ORDER BY date) multiplied by INTERVAL 1 day. Runs of consecutive dates produce a constant difference; runs separated by gaps produce different differences. GROUP BY (user, date minus rn times interval) and HAVING COUNT(*) greater-than-or-equal-to N detects streaks of length N or more. The trick generalizes to consecutive integers (date minus rn produces a constant) and to detecting groups by any monotonic sequence.
What is the SCD2 half-open join and why does it matter?
Joining a fact at event_time to a SCD Type 2 dimension uses ON dim.entity_id equals fact.entity_id AND dim.effective_from less-than-or-equal-to fact.event_time AND (dim.effective_to IS NULL OR fact.event_time less-than dim.effective_to). The half-open (less-than-or-equal on the left, strict less-than on the right) prevents two dim rows from matching at the exact changeover microsecond. The closed-interval mistake doubles facts at the boundary; the open-interval mistake drops facts at the boundary.
How does a data engineer handle skew in a SQL JOIN?
Identify the hot key with SELECT join_key, COUNT(*) FROM table GROUP BY join_key ORDER BY 2 DESC LIMIT 20. If the top key has 10 times the median count, salt both sides: append a mod-N suffix to the hot key (CONCAT(key, '_', user_id mod 8)), join, aggregate by the salted key, then unsalt and re-aggregate. Trade-off is extra shuffle cost (N-fold replication of the small side) versus balanced executor workload. In Spark this is AQE skew-join automatic; in SQL warehouses the data engineer does it manually.
When does a function in WHERE prevent predicate pushdown?
Whenever the function wraps the indexed column. WHERE DATE(event_ts) equals '2026-05-27' prevents the partition pruner from using the event_ts index because the optimizer cannot reason about DATE() in reverse. The fix is to invert: WHERE event_ts greater-than-or-equal-to '2026-05-27' AND event_ts less-than '2026-05-28'. Same applies to UPPER(name), CAST(id AS string), and any user-defined function on an indexed column.
What is an idempotent MERGE for late-arriving data?
MERGE INTO daily_revenue d USING (SELECT DATE(event_at) AS date, SUM(amount) AS revenue FROM events WHERE processed_at greater-than-or-equal-to since GROUP BY date) src ON d.date equals src.date WHEN MATCHED THEN UPDATE SET revenue equals d.revenue plus src.revenue WHEN NOT MATCHED THEN INSERT VALUES (src.date, src.revenue). The key is ADD, not REPLACE. Late events for an old date correct the existing total instead of overwriting it. Data engineer interviewers test whether you spot this; replacing is a silent bug.
What is QUALIFY and which engines support it?
QUALIFY filters the result of a window function the way HAVING filters the result of a GROUP BY. Native in Snowflake, BigQuery, and Teradata. Not in Postgres, MySQL, or SQL Server. On those engines, wrap the window in a CTE and filter in the outer SELECT. QUALIFY is a data engineer fluency signal in Snowflake and BigQuery interviews; mention it when you reach for the CTE-then-filter pattern.