Advanced SQL Interview Questions
Advanced SQL Interview Questions
Senior and staff data engineer SQL interview problems including recursive CTEs, skew handling, and EXPLAIN reading.
The senior-and-up subset of the data engineer SQL catalog. Recursive CTEs for hierarchies and graphs. Gap-and-island for streak detection. Sessionization with LAG and SUM OVER. SCD2 half-open joins. EXPLAIN plan reading. Skew handling with salt-and-rebalance. Idempotent MERGE patterns for late-arriving data.
Advanced SQL in 2026 data engineer interviews means the seven patterns that mid-level loops mostly skip and senior-plus loops live in. Recursive CTEs for org-chart and graph traversal use an anchor (WHERE id equals root) and a recursive clause (JOIN cte ON cte.id equals child.parent_id) with a depth column carried forward to bound recursion at 20 levels. Real HR data sometimes has cycles from bad data entry; the depth guard matters in production, not just in the interview. Gap-and-island for streak detection uses the trick of date minus ROW_NUMBER OVER (PARTITION BY user ORDER BY date), multiplied by INTERVAL 1 day; runs of consecutive dates produce a constant difference, so GROUP BY user and streak_key with HAVING COUNT greater-than-or-equal-to N detects streaks of length N or more. Sessionization with explicit gap thresholds uses LAG to get the previous event timestamp, CASE WHEN gap exceeds 30 minutes THEN 1 ELSE 0 AS new_session, SUM OVER ORDER BY ts to accumulate the session_id.
SCD2 half-open joins use effective_from less-than-or-equal-to event_time AND (effective_to IS NULL OR event_time less-than effective_to). The half-open interval prevents boundary doubling at the changeover microsecond when a dimension row's effective_to equals the next row's effective_from. The closed-interval mistake doubles facts; the open-interval mistake drops them. EXPLAIN plan reading at L5 and above asks the data engineer to identify sequential scan versus index seek, explain why a function in WHERE prevents predicate pushdown (WHERE DATE(timestamp) equals '2026-05-27' blocks the partition pruner; rewrite as WHERE timestamp greater-than-or-equal-to '2026-05-27' AND timestamp less-than '2026-05-28'), identify partition pruning and column-store implications, and explain hash join versus sort-merge join versus nested loop selection.
Skew handling for hot keys is a senior data engineer signal. Identify hot keys with SELECT join_key, COUNT(*) FROM table GROUP BY join_key ORDER BY 2 DESC LIMIT 20. If the top key has 10 times the median count, salt: append a mod-N suffix to the hot key on both sides (CONCAT(key, '_', user_id mod 8)), join on the salted key, aggregate by salted key, then strip the salt and re-aggregate. Trade-off is N-fold replication of the small side plus an extra aggregation pass versus the original one-task-doing-all-work bottleneck. In Spark this is AQE skew-join optimization done automatically; in SQL warehouses (Snowflake, BigQuery, Redshift) the data engineer does it manually or via clustering / micro-partition decisions.
Idempotent MERGE for late-arriving reconciliation is the L5-plus design-and-code question. MERGE INTO daily_revenue d USING (SELECT DATE(event_at) AS revenue_date, SUM(amount) AS revenue FROM events WHERE processed_at greater-than-or-equal-to since GROUP BY 1) src ON d.revenue_date equals src.revenue_date WHEN MATCHED THEN UPDATE SET revenue equals d.revenue plus src.revenue WHEN NOT MATCHED THEN INSERT VALUES (src.revenue_date, src.revenue). The principle is ADD, not REPLACE. Late events for an old date correct yesterday's total instead of overwriting it. Replacing is the silent-bug failure mode interviewers fish for.
- What is the recursive CTE pattern for an org chart?
- Anchor: SELECT id, name, 0 AS depth FROM employees WHERE id equals root_id. Recursive: SELECT e.id, e.name, cte.depth plus 1 FROM employees e JOIN cte ON e.manager_id equals cte.id WHERE cte.depth less-than 20. The depth column bounds the recursion and lets a data engineer answer the natural follow-up about org depth. Real HR data sometimes has cycles from bad data entry; the depth guard matters in production.
- How does the gap-and-island pattern work?
- For each user, take the activity date and subtract ROW_NUMBER OVER (PARTITION BY user ORDER BY date) multiplied by INTERVAL 1 day. Runs of consecutive dates produce a constant difference; runs separated by gaps produce different differences. GROUP BY (user, date minus rn times interval) and HAVING COUNT(*) greater-than-or-equal-to N detects streaks of length N or more. The trick generalizes to consecutive integers (date minus rn produces a constant) and to detecting groups by any monotonic sequence.
- What is the SCD2 half-open join and why does it matter?
- Joining a fact at event_time to a SCD Type 2 dimension uses ON dim.entity_id equals fact.entity_id AND dim.effective_from less-than-or-equal-to fact.event_time AND (dim.effective_to IS NULL OR fact.event_time less-than dim.effective_to). The half-open (less-than-or-equal on the left, strict less-than on the right) prevents two dim rows from matching at the exact changeover microsecond. The closed-interval mistake doubles facts at the boundary; the open-interval mistake drops facts at the boundary.
- How does a data engineer handle skew in a SQL JOIN?
- Identify the hot key with SELECT join_key, COUNT(*) FROM table GROUP BY join_key ORDER BY 2 DESC LIMIT 20. If the top key has 10 times the median count, salt both sides: append a mod-N suffix to the hot key (CONCAT(key, '_', user_id mod 8)), join, aggregate by the salted key, then unsalt and re-aggregate. Trade-off is extra shuffle cost (N-fold replication of the small side) versus balanced executor workload. In Spark this is AQE skew-join automatic; in SQL warehouses the data engineer does it manually.
- When does a function in WHERE prevent predicate pushdown?
- Whenever the function wraps the indexed column. WHERE DATE(event_ts) equals '2026-05-27' prevents the partition pruner from using the event_ts index because the optimizer cannot reason about DATE() in reverse. The fix is to invert: WHERE event_ts greater-than-or-equal-to '2026-05-27' AND event_ts less-than '2026-05-28'. Same applies to UPPER(name), CAST(id AS string), and any user-defined function on an indexed column.
- What is an idempotent MERGE for late-arriving data?
- MERGE INTO daily_revenue d USING (SELECT DATE(event_at) AS date, SUM(amount) AS revenue FROM events WHERE processed_at greater-than-or-equal-to since GROUP BY date) src ON d.date equals src.date WHEN MATCHED THEN UPDATE SET revenue equals d.revenue plus src.revenue WHEN NOT MATCHED THEN INSERT VALUES (src.date, src.revenue). The key is ADD, not REPLACE. Late events for an old date correct the existing total instead of overwriting it. Data engineer interviewers test whether you spot this; replacing is a silent bug.
- What is QUALIFY and which engines support it?
- QUALIFY filters the result of a window function the way HAVING filters the result of a GROUP BY. Native in Snowflake, BigQuery, and Teradata. Not in Postgres, MySQL, or SQL Server. On those engines, wrap the window in a CTE and filter in the outer SELECT. QUALIFY is a data engineer fluency signal in Snowflake and BigQuery interviews; mention it when you reach for the CTE-then-filter pattern.
133 practice problems matching this filter. Difficulty: medium (46), hard (69), easy (18).
SQL (133)
- 7-Check Rolling Average - medium - Seven entries hold the trend.
- 7-Day Onboarding Conversion - hard - Signed up Monday. Still here by Sunday?
- 7-Day Token Retention - medium - Premium tokens, day by day.
- Above Average Interactions - easy - The average user is boring. Who is above?
- Above Category Average - easy - The category average is one thing. These beat it.
- Active User Penetration Rate - hard - How much of the user base is actually alive?
- Adopters Before Migration - hard - They used the old feature. Did they ever touch the new one?
- The Vote Tally - hard - Net revenue, day by day, for one product in one region.
- Alert Severity - hard - When the alarms go off, who screams loudest?
- App Stability by Region - medium - Some regions crash more than others.
- Best Day for Ad Revenue - medium - One day of the month outperforms the rest.
- The Budget Line - easy - Some rows are over. Some are under. Label every one.
- Cache Efficiency - hard - Some edges run hot. Others coast on the global average.
- Campaign Bookend Engagement - hard - First impression versus last. The gap.
- The Notification That Paid Off - hard - The message went out to thousands. A smaller number actually bit.
- Campaign Conversion Window - hard - A narrow window between impression and action.
- Campaign Engagement Rank Shift - hard - Two months, many countries. Who moved up? Who fell?
- Category Deep Dive - hard - Revenue, units, rank. The full category report card.
- Cheapest CDN Route - easy - The cheapest path across regions.
- Commit Cadence - hard - Some repos go quiet for too long.
- Consecutive Cost Growth Periods - hard - Five straight months of spending increases.
- Content Recommendation Engine - medium - Pages they haven't discovered yet.
- Cost Efficiency Variance - hard - Cost efficiency varies. By how much?
- Deploy Velocity Swings - medium - Month to month, who sped up and who stalled.
- The Freshest Record - medium - Duplicates everywhere. Only the most recent version of the truth survives.
- The Clean Aisle Numbers - medium - Clear the noise. What did each category actually earn?
- Department Running Totals - medium - Compute cumulative metric values within each department using window operations.
- Deploy Velocity - hard - Days between deploys. Some services ship fast, others crawl.
- Early Commit Velocity by Author - medium - How productive was each author during the first year of a repo's CI pipeline
- Employees Per Department - easy - Headcount, location by location.
- Fault Lines - medium - Errors by day and region. Some areas are worse than they appear.
- Experiment Impact - hard - Which experiments moved the needle? Settle the standings inside every variant.
- Extreme Category Totals - medium - The highest and the lowest. Both are interesting.
- Fastest Completion Per Day - medium - Every day has a speed champion.
- Fastest Page View to Click - hard - How fast from view to click?
- Feature Flag Engagement Impact - hard - Flags on versus flags off. The engagement gap.
- Feature Flag Fan vs Detractor Pairs - hard - Some users love the flag. Others want it gone.
- First Interaction Credit - hard - Attribute transactions to earliest touchpoint
- Flatten Org Chart Hierarchy - hard - The tree runs deep. Walk every branch to the root.
- Friday Sessions for Shared Experiments - medium - Friday vibes only. Same experiment, different users.
- Engagement Depth by Event - hard - Where users actually spend their attention.
- Ghost Products - medium - Listed but never sold. The shelves collect dust.
- Heavy Ad Exposure - medium - Saturated with ads. Is it too much?
- Idle Team Members - easy - Sprint started. Some people never got assigned.
- Intra-Region Latency Diff - hard - Same region. Different latency.
- The Full Picture - easy - Two tables know different things about the same people. Combine them.
- Keep Most Recent Record - medium - Carbon copies clutter the table. Only the latest matters.
- Largest Group - easy - One group towers above the rest.
- Latest Commit Build Cost - medium - The latest commit came with a build cost.
- Longest Gap Between Token Events - medium - The longest gap between token events.
- Longest Uptime Streak - hard - Pass, pass, pass. How long until fail?
- Longest Visit Streaks - hard - Day after day after day. Who kept coming back?
- Long Messages - medium - Some commit messages tell a novel.
- Market Share - hard - Every category wants a bigger slice.
- Max Value Per Location - easy - Every location has a peak.
- Median Cloud Cost by Service - hard - The median cloud bill, service by service.
- Median Failure Rate by Table - hard - Half the tables fail more than this.
- Median Household Earnings - hard - Household earnings. The median reveals the middle.
- Median Model Accuracy - hard - The median accuracy. Not the mean.
- Metric Range Per Group - easy - The spread within each group.
- Mid-Range Team Spenders - hard - Above average but not extreme.
- Minimum Parallel Workers - hard - Too few workers and it stalls.
- Model Accuracy Drift - hard - Accuracy used to be higher.
- Monthly Category Totals - easy - Sum amounts by category and month.
- Monthly Running Total - medium - Cumulative sales per product across months.
- Monthly Service Retention - hard - Users came back. Or they did not.
- The Three-Way Report - medium - Three tables. One summary. Every piece depends on the others.
- Negative Outcome Rate for New Users - medium - New users have a rough first two weeks.
- New Customers Per Day - medium - Count users whose first order falls on each date.
- Normalization Tradeoffs in Practice - hard - Clean data or fast queries? You can't always have both.
- Nth Highest Salary Per Department - medium - Third place in every department.
- Nth Largest Value - easy - Select the row with a specific rank position.
- Peak Activity by Device - easy - Activity windows, device by device.
- Peak Concurrent Tokens - hard - How many tokens were alive at the same time?
- Pipeline Duration vs Throughput - hard - Does throughput correlate with duration?
- Pipeline Throughput Ratio - easy - Compute current-to-initial value ratio per period.
- The Event Breakdown - medium - Events are piling up by type. The report needs them side by side.
- Price Pairs - hard - Same shelf, wildly different stickers. Spot the pricing gaps.
- Quarter-over-Quarter Latency Trend - hard - Latency trending up or down? The quarters have the answer.
- Rapid Retry Detection - medium - Detect retried API calls within 5 minutes of failure.
- Recent Price Drops - medium - The price just dropped. Who noticed?
- The Subscription Ghost - medium - Some charges come back to haunt the same card a month later.
- Repeat Purchases Within a Week - medium - They bought again within seven days.
- Response Buckets - medium - Fast, normal, or slow. Every API call gets a verdict.
- Retried Failed API Calls - medium - Spot users who retry API calls within 5 minutes of a failure.
- Rolling Revenue Average - hard - Smooth out the revenue bumps. The trend matters more.
- Rolling Weekly Total - medium - Seven days at a time, the totals keep rolling forward.
- Honeymoon Phase - medium - How many wallets stay loyal the same year they say "I do"?
- Same First and Last Reply Target - medium - They started and ended the month messaging the same person.
- Second Purchase - hard - The first buy is curiosity. The second is commitment.
- Back From the Brink - hard - Roll it back, then nail the next one.
- Service Uptime Minutes - medium - Status changed. How long was it actually up?
- Service Uptime Turnaround - hard - It was down. Then it came back. Stronger.
- Session Page View Distance - hard - Page view distance per session.
- Shared Channel Contacts - hard - User networks mapped through messages.
- The Conversion Story - medium - Signups are one thing. Paid purchases are another. Find the gap by source.
- Slow Batch Jobs - easy - Promised by noon. Delivered at midnight.
- The Address That Changed - hard - Addresses change. History must not be erased.
- Smooth Latency - medium - Noisy latency readings, smoothed into a trend you can trust.
- Spend and Rank - hard - Five thrones at the top of the spending leaderboard.
- Spending Range - hard - Between the smallest purchase and the biggest lies the story.
- Streak Status Changes - hard - Detect value changes across consecutive rows
- Subscribers Without Premium - medium - Subscribed. But never upgraded.
- Successful Call Volume per Endpoint - medium - Not every ping is honest.
- Team Cost Allocation Comparison - hard - Individual spend versus team average.
- The Cannibalization Report - hard - The new product launched. The old one suffered.
- The Latest Transaction Per Product - medium - Every product has a last sale. When was it?
- The Regional Cost Reconciliation - hard - Two cost tables, one region. Reconcile the running balance.
- The Session Stitcher - hard - Page views without sessions are just noise.
- Top Average By Region - easy - Region by region, who pulls the best average?
- Top Campaign by User Revenue - medium - Which campaign made each user spend the most?
- Top Commit Authors by Repo - hard - Three authors per repo. The top committers.
- Top CPU Pods per Namespace - hard - The two most CPU-hungry pods in each namespace.
- Top Endpoint by Power Users - hard - Power users have a favorite endpoint.
- Top Flagged Campaign Resolutions - hard - Flagged the most. Resolved how?
- Top Lessons Each Month - medium - Rank items within time periods and keep top 3
- Top Models by Framework - hard - Every framework has a star model.
- Top Percentile API Tokens - hard - The most suspicious tokens.
- Top Percentile Spenders - medium - Top 1% of users by total spend via percentile bucketing.
- Top Recent Sellers - easy - Fresh data, top sellers. The recent leaderboard.
- Top Regions by High CPU Nodes - hard - Five regions with the hottest CPUs.
- Top Selling Items - easy - Revenue crowns the winners. Who sold the most?
- The Heavy Hitters - medium - Within each cloud, two services rise above the rest.
- Transaction-Only Features - hard - Exclusive to one source. Missing from the other.
- Unique Hostnames per Region - medium - How many distinct machines live in each region?
- Upvote Percentage by Age Cohort - hard - New users versus existing. The upvote gap.
- User 360 - hard - One row per user. Everything they did, or didn't do.
- User Campaign Overlap Percentage - hard - How much ad overlap between users?
- Six Degrees - hard - Every reply ties two names together. Find whose web reaches the furthest.
- User Spend Segmentation by Category - hard - Users segmented by spending behavior.
- Weekly Build Status Report - hard - Every CI run, bucketed by week.
- Weighted Variant Selection - hard - Select a row using cumulative weight probabilities.
- YoY Signup Growth Rate - hard - This year versus last year. Growing or shrinking?