What is the recursive CTE pattern for an org chart?

Anchor: SELECT id, name, 0 AS depth FROM employees WHERE id equals root_id. Recursive: SELECT e.id, e.name, cte.depth plus 1 FROM employees e JOIN cte ON e.manager_id equals cte.id WHERE cte.depth less-than 20. The depth column bounds the recursion and lets a data engineer answer the natural follow-up about org depth. Real HR data sometimes has cycles from bad data entry; the depth guard matters in production.

How does the gap-and-island pattern work?

For each user, take the activity date and subtract ROW_NUMBER OVER (PARTITION BY user ORDER BY date) multiplied by INTERVAL 1 day. Runs of consecutive dates produce a constant difference; runs separated by gaps produce different differences. GROUP BY (user, date minus rn times interval) and HAVING COUNT(*) greater-than-or-equal-to N detects streaks of length N or more. The trick generalizes to consecutive integers (date minus rn produces a constant) and to detecting groups by any monotonic sequence.

What is the SCD2 half-open join and why does it matter?

Joining a fact at event_time to a SCD Type 2 dimension uses ON dim.entity_id equals fact.entity_id AND dim.effective_from less-than-or-equal-to fact.event_time AND (dim.effective_to IS NULL OR fact.event_time less-than dim.effective_to). The half-open (less-than-or-equal on the left, strict less-than on the right) prevents two dim rows from matching at the exact changeover microsecond. The closed-interval mistake doubles facts at the boundary; the open-interval mistake drops facts at the boundary.

How does a data engineer handle skew in a SQL JOIN?

Identify the hot key with SELECT join_key, COUNT(*) FROM table GROUP BY join_key ORDER BY 2 DESC LIMIT 20. If the top key has 10 times the median count, salt both sides: append a mod-N suffix to the hot key (CONCAT(key, '_', user_id mod 8)), join, aggregate by the salted key, then unsalt and re-aggregate. Trade-off is extra shuffle cost (N-fold replication of the small side) versus balanced executor workload. In Spark this is AQE skew-join automatic; in SQL warehouses the data engineer does it manually.

When does a function in WHERE prevent predicate pushdown?

Whenever the function wraps the indexed column. WHERE DATE(event_ts) equals '2026-05-27' prevents the partition pruner from using the event_ts index because the optimizer cannot reason about DATE() in reverse. The fix is to invert: WHERE event_ts greater-than-or-equal-to '2026-05-27' AND event_ts less-than '2026-05-28'. Same applies to UPPER(name), CAST(id AS string), and any user-defined function on an indexed column.

What is QUALIFY and which engines support it?

QUALIFY filters the result of a window function the way HAVING filters the result of a GROUP BY. Native in Snowflake, BigQuery, and Teradata. Not in Postgres, MySQL, or SQL Server. On those engines, wrap the window in a CTE and filter in the outer SELECT. QUALIFY is a data engineer fluency signal in Snowflake and BigQuery interviews; mention it when you reach for the CTE-then-filter pattern.

Advanced SQL Interview Questions for Senior DE Roles

Advanced SQL Interview Questions

Senior and staff data engineer SQL interview problems including recursive CTEs, skew handling, and EXPLAIN reading.

The senior-and-up subset of the data engineer SQL catalog. Recursive CTEs for hierarchies and graphs. Gap-and-island for streak detection. Sessionization with LAG and SUM OVER. SCD2 half-open joins. EXPLAIN plan reading. Skew handling with salt-and-rebalance. Idempotent MERGE patterns for late-arriving data.

Advanced SQL in 2026 data engineer interviews means the seven patterns that mid-level loops mostly skip and senior-plus loops live in. Recursive CTEs for org-chart and graph traversal use an anchor (WHERE id equals root) and a recursive clause (JOIN cte ON cte.id equals child.parent_id) with a depth column carried forward to bound recursion at 20 levels. Real HR data sometimes has cycles from bad data entry; the depth guard matters in production, not just in the interview. Gap-and-island for streak detection uses the trick of date minus ROW_NUMBER OVER (PARTITION BY user ORDER BY date), multiplied by INTERVAL 1 day; runs of consecutive dates produce a constant difference, so GROUP BY user and streak_key with HAVING COUNT greater-than-or-equal-to N detects streaks of length N or more. Sessionization with explicit gap thresholds uses LAG to get the previous event timestamp, CASE WHEN gap exceeds 30 minutes THEN 1 ELSE 0 AS new_session, SUM OVER ORDER BY ts to accumulate the session_id.

SCD2 half-open joins use effective_from less-than-or-equal-to event_time AND (effective_to IS NULL OR event_time less-than effective_to). The half-open interval prevents boundary doubling at the changeover microsecond when a dimension row's effective_to equals the next row's effective_from. The closed-interval mistake doubles facts; the open-interval mistake drops them. EXPLAIN plan reading at L5 and above asks the data engineer to identify sequential scan versus index seek, explain why a function in WHERE prevents predicate pushdown (WHERE DATE(timestamp) equals '2026-05-27' blocks the partition pruner; rewrite as WHERE timestamp greater-than-or-equal-to '2026-05-27' AND timestamp less-than '2026-05-28'), identify partition pruning and column-store implications, and explain hash join versus sort-merge join versus nested loop selection.

Skew handling for hot keys is a senior data engineer signal. Identify hot keys with SELECT join_key, COUNT(*) FROM table GROUP BY join_key ORDER BY 2 DESC LIMIT 20. If the top key has 10 times the median count, salt: append a mod-N suffix to the hot key on both sides (CONCAT(key, '_', user_id mod 8)), join on the salted key, aggregate by salted key, then strip the salt and re-aggregate. Trade-off is N-fold replication of the small side plus an extra aggregation pass versus the original one-task-doing-all-work bottleneck. In Spark this is AQE skew-join optimization done automatically; in SQL warehouses (Snowflake, BigQuery, Redshift) the data engineer does it manually or via clustering / micro-partition decisions.

Idempotent MERGE for late-arriving reconciliation is the L5-plus design-and-code question. MERGE INTO daily_revenue d USING (SELECT DATE(event_at) AS revenue_date, SUM(amount) AS revenue FROM events WHERE processed_at greater-than-or-equal-to since GROUP BY 1) src ON d.revenue_date equals src.revenue_date WHEN MATCHED THEN UPDATE SET revenue equals d.revenue plus src.revenue WHEN NOT MATCHED THEN INSERT VALUES (src.revenue_date, src.revenue). The principle is ADD, not REPLACE. Late events for an old date correct yesterday's total instead of overwriting it. Replacing is the silent-bug failure mode interviewers fish for.

Full SQL interview questions catalog - 931 problems including the full senior data engineer tier.
Window functions as foundation for advanced - Sessionization and gap-and-island build on window functions.
Senior data engineer interview problems across domains - Advanced SQL alongside senior modeling, design, and Python.
System design interview prep for senior data engineer - End-to-end pipeline design at senior scale with 3 failure modes per component.
ETL design with idempotency and replay - MERGE patterns for late-arriving data, schema evolution, idempotent backfills.
Stripe data engineer interview questions - Idempotent reconciliation, SCD2 half-open joins, EXPLAIN reading on financial data.
Netflix data engineer interview questions - Streaming sessionization, late-arriving data, Spark UI integration with SQL.

What is the recursive CTE pattern for an org chart?: Anchor: SELECT id, name, 0 AS depth FROM employees WHERE id equals root_id. Recursive: SELECT e.id, e.name, cte.depth plus 1 FROM employees e JOIN cte ON e.manager_id equals cte.id WHERE cte.depth less-than 20. The depth column bounds the recursion and lets a data engineer answer the natural follow-up about org depth. Real HR data sometimes has cycles from bad data entry; the depth guard matters in production.
How does the gap-and-island pattern work?: For each user, take the activity date and subtract ROW_NUMBER OVER (PARTITION BY user ORDER BY date) multiplied by INTERVAL 1 day. Runs of consecutive dates produce a constant difference; runs separated by gaps produce different differences. GROUP BY (user, date minus rn times interval) and HAVING COUNT(*) greater-than-or-equal-to N detects streaks of length N or more. The trick generalizes to consecutive integers (date minus rn produces a constant) and to detecting groups by any monotonic sequence.
What is the SCD2 half-open join and why does it matter?: Joining a fact at event_time to a SCD Type 2 dimension uses ON dim.entity_id equals fact.entity_id AND dim.effective_from less-than-or-equal-to fact.event_time AND (dim.effective_to IS NULL OR fact.event_time less-than dim.effective_to). The half-open (less-than-or-equal on the left, strict less-than on the right) prevents two dim rows from matching at the exact changeover microsecond. The closed-interval mistake doubles facts at the boundary; the open-interval mistake drops facts at the boundary.
How does a data engineer handle skew in a SQL JOIN?: Identify the hot key with SELECT join_key, COUNT(*) FROM table GROUP BY join_key ORDER BY 2 DESC LIMIT 20. If the top key has 10 times the median count, salt both sides: append a mod-N suffix to the hot key (CONCAT(key, '_', user_id mod 8)), join, aggregate by the salted key, then unsalt and re-aggregate. Trade-off is extra shuffle cost (N-fold replication of the small side) versus balanced executor workload. In Spark this is AQE skew-join automatic; in SQL warehouses the data engineer does it manually.
When does a function in WHERE prevent predicate pushdown?: Whenever the function wraps the indexed column. WHERE DATE(event_ts) equals '2026-05-27' prevents the partition pruner from using the event_ts index because the optimizer cannot reason about DATE() in reverse. The fix is to invert: WHERE event_ts greater-than-or-equal-to '2026-05-27' AND event_ts less-than '2026-05-28'. Same applies to UPPER(name), CAST(id AS string), and any user-defined function on an indexed column.
What is an idempotent MERGE for late-arriving data?: MERGE INTO daily_revenue d USING (SELECT DATE(event_at) AS date, SUM(amount) AS revenue FROM events WHERE processed_at greater-than-or-equal-to since GROUP BY date) src ON d.date equals src.date WHEN MATCHED THEN UPDATE SET revenue equals d.revenue plus src.revenue WHEN NOT MATCHED THEN INSERT VALUES (src.date, src.revenue). The key is ADD, not REPLACE. Late events for an old date correct the existing total instead of overwriting it. Data engineer interviewers test whether you spot this; replacing is a silent bug.
What is QUALIFY and which engines support it?: QUALIFY filters the result of a window function the way HAVING filters the result of a GROUP BY. Native in Snowflake, BigQuery, and Teradata. Not in Postgres, MySQL, or SQL Server. On those engines, wrap the window in a CTE and filter in the outer SELECT. QUALIFY is a data engineer fluency signal in Snowflake and BigQuery interviews; mention it when you reach for the CTE-then-filter pattern.

133 practice problems matching this filter. Difficulty: medium (46), hard (69), easy (18).

SQL (133)

7-Check Rolling Average - medium - Seven entries hold the trend.
7-Day Onboarding Conversion - hard - Signed up Monday. Still here by Sunday?
7-Day Token Retention - medium - Premium tokens, day by day.
Above Average Interactions - easy - The average user is boring. Who is above?
Above Category Average - easy - The category average is one thing. These beat it.
Active User Penetration Rate - hard - How much of the user base is actually alive?
Adopters Before Migration - hard - They used the old feature. Did they ever touch the new one?
The Vote Tally - hard - One product, every day it sold, and the money it brought in.
Alert Severity - hard - When the alarms go off, who screams loudest?
App Stability by Region - medium - Some regions crash more than others.
Best Day for Ad Revenue - medium - Some days of the month convert far better than others.
The Budget Line - easy - Some rows are over. Some are under. Label every one.
Cache Efficiency - hard - Some edges run hot. Others coast on the global average.
Campaign Bookend Engagement - hard - First impression versus last. The gap.
The Notification That Paid Off - hard - The message went out to thousands. A smaller number actually bit.
Campaign Conversion Window - hard - A narrow window between impression and action.
Losing Altitude - hard - Two months apart, the same campaigns. Find the ones that slipped.
Category Deep Dive - hard - Revenue, units, rank. The full category report card.
Cheapest CDN Route - easy - The cheapest path across regions.
Commit Cadence - hard - Some repos go quiet for too long.
Consecutive Cost Growth Periods - hard - Five straight months of spending increases.
The Blind Spot - medium - Pages they haven't discovered yet.
Cost Efficiency Variance - hard - Cost efficiency varies. By how much?
Deploy Velocity Swings - medium - Month to month, who sped up and who stalled.
The Freshest Record - medium - Duplicates everywhere. Only the most recent version of the truth survives.
The Clean Aisle Numbers - medium - Clear the noise. What did each category actually earn?
Department Running Totals - medium - Compute cumulative metric values within each department using window operations.
Deploy Velocity - hard - Days between deploys. Some services ship fast, others crawl.
Early Commit Velocity by Author - medium - How productive was each author during the first year of a repo's CI pipeline
The Roster - easy - Every department, counted head by head.
Fault Lines - medium - Errors by day and region. Some areas are worse than they appear.
Experiment Impact - hard - Which experiments moved the needle? Settle the standings inside every variant.
Extreme Category Totals - medium - The highest and the lowest. Both are interesting.
Fastest Completion Per Day - medium - Every day has a speed champion.
Fastest Page View to Click - hard - How fast from view to click?
Feature Flag Engagement Impact - hard - Flags on versus flags off. The engagement gap.
Feature Flag Fan vs Detractor Pairs - hard - Some users love the flag. Others want it gone.
First Interaction Credit - hard - Attribute transactions to earliest touchpoint
Chain of Command - hard - The tree runs deep. Trace every branch back to the top.
The Company You Keep - medium - Only experiments with a crowd count. Find the Friday sessions they drove.
The Weight of a Click - hard - Where users actually spend their attention.
Ghost Products - medium - Listed but never sold. The shelves collect dust.
Heavy Ad Exposure - medium - Saturated with ads. Is it too much?
Idle Team Members - easy - Sprint started. Some people never got assigned.
Intra-Region Latency Diff - hard - Same region. Different latency.
The Full Picture - easy - Two tables know different things about the same people. Combine them.
Keep Most Recent Record - medium - Carbon copies clutter the table. Only the latest matters.
Largest Group - easy - One group towers above the rest.
Latest Commit Build Cost - medium - The latest commit came with a build cost.
Longest Gap Between Token Events - medium - The longest gap between token events.
Longest Uptime Streak - hard - Pass, pass, pass. How long until fail?
Longest Visit Streaks - hard - Day after day after day. Who kept coming back?
Long Messages - medium - Some commit messages tell a novel.
Market Share - hard - Every category wants a bigger slice.
Max Value Per Location - easy - Every location has a peak.
Median Cloud Cost by Service - hard - The median cloud bill, service by service.
Median Failure Rate by Table - hard - Half the tables fail more than this.
Below the Peaks - hard - The biggest bills catch every eye. The overspend hiding just beneath them does not.
Median Model Accuracy - hard - The median accuracy. Not the mean.
Metric Range Per Group - easy - The spread within each group.
The Comfortable Middle - hard - Above average but not extreme.
Minimum Parallel Workers - hard - Too few workers and it stalls.
Then and Now - hard - Accuracy used to be higher.
Monthly Category Totals - easy - Sum amounts by category and month.
Monthly Running Total - medium - Cumulative sales per product across months.
Who Stayed - hard - A footprint either holds or it slips away. Find how much stays.
The Three-Way Report - medium - Three tables. One summary. Every piece depends on the others.
The Fault Lines - medium - Some account groups hit far more errors than others. Surface the ones that break most.
New Customers Per Day - medium - Count users whose first order falls on each date.
Normalization Tradeoffs in Practice - hard - Clean data or fast queries? You can't always have both.
Nth Highest Salary Per Department - medium - Third place in every department.
The Upper Rungs - easy - The top rungs set the ceiling for everyone below.
Peak Activity by Device - easy - Activity windows, device by device.
All at Once - hard - How many tokens were alive at the same time?
Pipeline Duration vs Throughput - hard - Does throughput correlate with duration?
Pipeline Throughput Ratio - easy - Compute current-to-initial value ratio per period.
The Shape of a User - medium - Every user leaves a trail of events. The report needs them lined up side by side.
Price Pairs - hard - Same shelf, wildly different stickers. Spot the pricing gaps.
Quarters Apart - hard - Latency trending up or down? The quarters have the answer.
Rapid Retry Detection - medium - Detect retried API calls within 5 minutes of failure.
Recent Price Drops - medium - The price just dropped. Who noticed?
The Subscription Ghost - medium - Some charges come back to haunt the same card a month later.
Repeat Purchases Within a Week - medium - They bought again within seven days.
Response Buckets - medium - Fast, normal, or slow. Every API call gets a verdict.
Retried Failed API Calls - medium - Spot users who retry API calls within 5 minutes of a failure.
Rolling Revenue Average - hard - Smooth out the revenue bumps. The trend matters more.
Spending Velocity - medium - Fraud hides in the pace, not the price. Watch each customer's last seven purchases roll forward.
Honeymoon Phase - medium - Every signup class starts hot. Which ones spend the year they arrive?
Same First and Last Reply Target - medium - They started and ended the month messaging the same person.
Second Purchase - hard - The first buy is curiosity. The second is commitment.
Back From the Brink - hard - Roll it back, then nail the next one.
Service Uptime Minutes - medium - Status changed. How long was it actually up?
Service Uptime Turnaround - hard - It was down. Then it came back. Stronger.
Session Page View Distance - hard - Page view distance per session.
Shared Channel Contacts - hard - User networks mapped through messages.
The Conversion Story - medium - Signups are one thing. Paid purchases are another. Find the gap by source.
Slow Batch Jobs - easy - Promised by noon. Delivered at midnight.
The Address That Changed - hard - Addresses change. History must not be erased.
Smooth Latency - medium - Noisy latency readings, smoothed into a trend you can trust.
Spend and Rank - hard - Five thrones at the top of the spending leaderboard.
Spending Range - hard - Between the smallest purchase and the biggest lies the story.
Streak Status Changes - hard - Detect value changes across consecutive rows
The Holdouts - medium - Subscribed. But never upgraded.
Successful Call Volume per Endpoint - medium - Not every ping is honest.
Team Cost Allocation Comparison - hard - Individual spend versus team average.
The Cannibalization Report - hard - The new product launched. The old one suffered.
The Final Sale - medium - Every product has a last sale. When was it?
Two Sides of the Ledger - hard - Two cost tables, one region. Which way does the balance tip?
The Session Stitcher - hard - Page views without sessions are just noise.
Top Average By Region - easy - Region by region, who pulls the best average?
Top Campaign by User Revenue - medium - Which campaign made each user spend the most?
Top Commit Authors by Repo - hard - Three authors per repo. The top committers.
Top CPU Pods per Namespace - hard - The two most CPU-hungry pods in each namespace.
Top Endpoint by Power Users - hard - Power users have a favorite endpoint.
Top Flagged Campaign Resolutions - hard - Flagged the most. Resolved how?
Top Lessons Each Month - medium - Rank items within time periods and keep top 3
Competing Standards - hard - Every framework has a star model.
Top Percentile API Tokens - hard - The most suspicious tokens.
Top Percentile Spenders - medium - Top 1% of users by total spend via percentile bucketing.
Top Recent Sellers - easy - Fresh data, top sellers. The recent leaderboard.
Top Regions by High CPU Nodes - hard - Five regions with the hottest CPUs.
Top Selling Items - easy - Revenue crowns the winners. Who sold the most?
The Heavy Hitters - medium - Within each cloud, two services rise above the rest.
Only Here - hard - Exclusive to one source. Missing from the other.
Unique Hostnames per Region - medium - How many distinct machines live in each region?
Age of Discovery - hard - Every generation searches differently. See who finds what they came for.
User 360 - hard - One row per user. Everything they did, or didn't do.
User Campaign Overlap Percentage - hard - How much ad overlap between users?
Six Degrees - hard - Every reply ties two names together. Find whose web reaches the furthest.
Tiers of Want - hard - In each category, some carts weigh more than others. Sort the shoppers by what they spend.
Weekly Build Status Report - hard - Every CI run, bucketed by week.
Weighted Variant Selection - hard - Select a row using cumulative weight probabilities.
YoY Signup Growth Rate - hard - This year versus last year. Growing or shrinking?