SQL Is Now the #1 DE Interview Filter in 2026

SQL jumped from 61% to 79% of DE job postings in one year. Engineers prepping Spark are failing Amazon and Uber loops on window functions. Here is the real 2026 stack.

DataDriven Field Notes
9 min readBy DataDriven Editorial
What this post covers
  1. 01What SQL Questions Actually Look Like Now: Real deduplication, running total, and CTE prompts from live loops
  2. 02Window Functions as the Real Screener: Why window functions specifically eliminate candidates under live pressure
  3. 03The Actual 2026 DE Skill Stack Order: SQL and Python ranked ahead of Spark, dbt, Kafka in live postings
  4. 04Why Spark Certification Alone Gets You Cut: Distributed systems prep masking fatal SQL gaps at screen stage
  5. 05Amazon and Uber Rejections: The SQL Pattern: What FAANG rejection postmortems reveal about window function gaps
  6. 06SQL's Surge from #5 to #4 in Job Postings: SQL frequency in DE postings jumped 61% to 79% in one year
  7. 07How to Fix Three Years of SQL Neglect Fast: Targeted SQL reactivation plan for Spark-focused engineers interviewing now

I watched a guy with 11 years of distributed systems experience bomb an Amazon loop last year. He'd architected streaming pipelines processing billions of events. His Databricks certification was still warm. The interviewer put a table on the screen and asked him to deduplicate user sessions with a running total. He froze. Couldn't write ROW_NUMBER() OVER(PARTITION BY ...) under pressure. Forty-five minutes of silence and partial queries. Screen over. The data engineer SQL interview 2026 reality hit him like a truck: SQL is the filter, and he'd been prepping the wrong thing for three years.

This isn't an isolated story. It's a pattern.

Prepare for the interview
01 / Open invite
02min.

Know the patterns before the interviewer asks them.

a system design query, the same shape a screen would give you.
The diff against expected. Where ties broke. What you missed.
sandbox
1source → bronze → silver → gold
2 ingest : CDC + Kafka
3 transform : dbt + Airflow
4 serve : Snowflake
5
Execute your solution0.4s avg.
PayPalInterview question
Solve a problem

SQL Hit 79% of DE Job Postings. Spark Didn't.

According to 365 Data Science's 2026 job market report, SQL appeared in 79.4% of data engineer postings in 2025, making it the single most demanded technical skill above Python (70%). Apache Spark? 38.7%. That's not even close.

The 2026 numbers show SQL stabilizing around 69%, roughly tied with Python. Some analysts attribute the dip to job descriptions folding SQL into Python requirements (SQLAlchemy, pandas). But here's what matters: in three out of four DE job postings, SQL is explicitly required. Spark is in fewer than two out of five. The data engineering skills 2026 hierarchy isn't ambiguous.

Snowflake sits at 29.2% of postings. Databricks at 16.8%. Both are SQL-first platforms. The Hadoop-era stack that made Spark the center of gravity is dying. SQL-based warehouses ate its lunch. And the interview process followed.

Everyone talks about Spark. SQL still runs the data world.

Meanwhile, Scala and Hadoop skills are declining as teams shift from distributed-processing frameworks to SQL-on-warehouse architectures. The Spark interview questions that dominated 2022 prep guides now surface in round two or three, if they surface at all. SQL is round one. Fail round one, rounds two and three don't exist.

What SQL Questions Actually Look Like in 2026 Loops

Forget SELECT * FROM users WHERE active = 1. That's not what's killing people.

Nearly 70% of Amazon SQL interview questions require JOINs, CTEs, or subqueries. Uber's 2025-2026 interview feedback is explicit: "3 medium-level SQL questions heavily focused on window functions (PARTITION OVER/LEAD/LAG)." Gaps-and-islands patterns appear in 40%+ of hard SQL questions across the documented pool of 80 recurring screening problems.

Here's what a typical deduplication question looks like. You have an events table with duplicates from retry logic. Keep the most recent event per user:

-- Deduplicate events: keep latest per user_id
WITH ranked AS (
  SELECT *,
    ROW_NUMBER() OVER(
      PARTITION BY user_id
      ORDER BY event_timestamp DESC, event_id DESC
    ) AS rn
  FROM raw_events
)
SELECT user_id, event_type, event_timestamp
FROM ranked
WHERE rn = 1;

That ORDER BY event_timestamp DESC, event_id DESC is the deterministic tiebreaker. Candidates who write ORDER BY event_timestamp DESC alone just introduced nondeterminism on ties. Interviewers catch this. It's the difference between someone who thinks about production data (where ties happen constantly) and someone who practiced on toy datasets.

Then comes the follow-up: "Now add a 7-day running average of events per user." This is where candidates who haven't drilled window function practice problems start sweating:

-- 7-day trailing average of daily events per user
SELECT
  user_id,
  event_date,
  daily_count,
  AVG(daily_count) OVER(
    PARTITION BY user_id
    ORDER BY event_date
    ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
  ) AS avg_7d
FROM daily_user_events;

The secret here is the frame specification. ROWS BETWEEN 6 PRECEDING AND CURRENT ROW gives you exactly 7 days including today. Get it wrong, your numbers shift by a day or a week. Candidates know the syntax; they don't know the semantics. That's the gap.

CTEs Are Not Optional

Hiring managers flag nested subqueries as a red flag. If you can't use CTEs, your code becomes an unreadable mess, and the interviewer is reading your code in real time, deciding whether they want to review your PRs for the next two years. CTE fluency separates tier-2 from tier-1 candidates under pressure. It's not about correctness; it's about whether the person across the table can follow your logic at 9 AM on a Tuesday.

Window Functions: The Real DE Interview Screener

Window functions appear in roughly 80% of data engineering technical screens. They're cited consistently as "the dividing line between junior and intermediate SQL users." This is the window functions data engineering interview reality: if you can't write them reflexively, you're done before the system design round starts.

Here's why they work as a filter. Window functions test three skills simultaneously:

  • Query execution order. Window functions evaluate AFTER WHERE, GROUP BY, and HAVING, but BEFORE DISTINCT and final ORDER BY. You cannot filter on a window function in the same query level. WHERE RANK() <= 3 is a syntax error in every major SQL engine, and candidates write it "far more than any other syntax mistake."
  • Pattern recall under pressure. Running totals, LAG/LEAD chaining, gaps-and-islands. There's no "try harder" path; you either know the frame syntax or you freeze.
  • Production thinking. ROW_NUMBER vs. RANK vs. DENSE_RANK differs only in tie handling (1,2,3 vs. 1,1,3 vs. 1,1,2). Over 50% of candidates can't articulate the difference under time pressure. In production deduplication, reaching for RANK() when you need ROW_NUMBER() silently keeps duplicates.

A single deceptively simple SQL question about WHERE vs. HAVING fails 70% of candidates. These are senior engineers with a decade of experience. They stumble on fundamental execution order, which is a prerequisite for understanding when window functions evaluate. If you don't know that WHERE runs before GROUP BY, you have no chance of understanding why your window function gives wrong results.

Running total patterns appear in approximately 70% of advanced SQL interviews. This is the bread-and-butter pattern most candidates learned years ago and abandoned during the Spark wave. Cumulative sums, session detection, anomaly windows. All window functions. All tested live.

Replicate It Without Breaking It

> Our OLTP database is under constant write pressure and we can't run analytics queries against it directly. We want to replicate it continuously into a Delta lake so analysts can query it without impacting production. The data changes constantly and our analysts need it to be current within minutes. Design the streaming pipeline.

+ Source
+ Transform
+ Storage
+ Quality
+ Consumer
+ Queue
Bronze
Silver
Gold
Custom
Pipeline Architecture
Sketch the architecture.

Click or drag a node from the toolbar above. Right-click the canvas for the full menu.

Drag from a node's right port to another node's left port to wire data flow.

Why Spark Certification Alone Gets You Cut

The Databricks Certified Data Engineer Associate exam weights Spark SQL + Python at only 29% combined, with 24% dedicated to platform-specific features like Delta Lake, Auto Loader, and Unity Catalog. The cert teaches tool APIs, not core SQL fundamentals like recursive CTEs or partition-aware aggregation.

In May 2026, Databricks refreshed the Associate exam to move away from abstract Spark concepts toward hands-on lakehouse operations. That's the certification body itself acknowledging the shift. When even Databricks says "less Spark theory, more practical SQL," the signal is loud.

The SQL vs Spark data engineer hiring mismatch is structural. A decade of Spark optimization teaches system thinking but does not translate to writing a correct deduplication query with deterministic tie-breaking under live observation. These are measuring different skills entirely. I've been on hiring panels where a candidate gave an incredible system design walkthrough, then couldn't write a basic LAG() to compute month-over-month change. We had to pass.

The screening gates SQL before system design. Amazon and Uber block candidates at the SQL round; they never reach the architecture stage. Your streaming pipeline on Spark doesn't matter if you can't write ROW_NUMBER() OVER(PARTITION BY event_id ORDER BY timestamp DESC) without hesitation. Check the Amazon DE interview guide if you want the full breakdown of what each round covers.

You can learn Spark after getting the job, during your first month on the team. Start with SQL and Python instead.

Amazon and Uber Rejections: The Pattern

Amazon's 3-6 round interview loop includes dedicated SQL assessments testing joins, CTEs, window functions, query optimization, and Redshift-specific tuning (sort keys, dist keys). One documented 2025 rejection involved finding the top 5 users by activity in a 30-day window excluding weekends. That's a straightforward ROW_NUMBER/RANK problem. The candidate failed it.

Uber's data engineer assessments feature medium-level SQL on real event-driven datasets, including retention curve analysis: signup-date cohort analysis using window functions to compute day-1/day-7 return fractions. Interviewers explicitly score candidates on null handling, duplicate logic, and time-boundary edge cases.

The common rejection patterns are remarkably consistent:

  • Omitting ORDER BY in LAG()/LEAD(), comparing the current row to an arbitrary row instead of the chronologically prior one
  • Confusing RANK with ROW_NUMBER on ties, silently keeping duplicates
  • Inability to explain nulls or edge-case cardinality when the interviewer asks "what happens if two records have the same timestamp?"
  • Writing queries that work on toy data but break under duplicates and late-arriving events

Uber explicitly grades whether you "state assumptions about data upfront." Before you write a single line, the interviewer wants to hear: "Is each trip_id unique per driver per day?" Candidates who skip this step write queries that silently multiply rows. That's not a SQL problem; that's a data modeling problem dressed in SQL clothes.

The hiring competition makes this worse. Job seekers face 242 competitors per data engineer role with a 2-3% interview conversion rate. You get one shot at the SQL screen. If window functions aren't reflexive, you're joining the 97% who don't convert.

The Actual 2026 DE Skill Stack

Here's the data engineering skills 2026 hierarchy based on job posting frequency:

Skill% of DE PostingsInterview Weight
Python70%Round 1-2 (coding)
SQL69-79%Round 1 (screening gate)
Apache Spark38.7%Round 2-3 (system design depth)
Snowflake29.2%Role-specific
Databricks16.8%Role-specific
AI/ML infrastructure12%Emerging (43% salary premium)

SQL and Python are the non-negotiables. Spark is supplementary. dbt has become standard for transformation work, ranked above Spark in many 2026 hiring ladders. The dbt interview questions are worth drilling if you're targeting analytics engineering-adjacent roles.

The contrarian insight here: Spark knowledge goes unused in the first round. SQL knowledge goes exploited, because it reveals whether you think in sets or procedurally. A candidate who reaches for a loop instead of a window function just told the interviewer everything they need to know.

How to Fix Three Years of SQL Neglect in Four Weeks

If you spent 2022-2024 chasing Spark certifications and your window functions are rusty, here's the reactivation plan. Research suggests advanced SQL proficiency is achievable in 3-4 weeks with focused daily practice on real datasets.

Week 1: Foundations You Think You Know

GROUP BY appears in 32% of screening questions. INNER JOIN in 29%. These are the baseline. If you hesitate on WHERE vs. HAVING, or can't explain NULL behavior in LEFT JOINs, start here. Do 5 problems a day from the SQL interview question bank. Speak your reasoning out loud while writing. That habit pays off in live rounds.

Week 2: Window Functions Until They're Reflexive

ROW_NUMBER, RANK, DENSE_RANK, LAG, LEAD. Frame specifications: ROWS BETWEEN vs. RANGE BETWEEN. Write a deduplication query, a running total, and a gaps-and-islands solution every single day. By the end of the week, PARTITION BY and ORDER BY should flow without thinking.

Here's the gaps-and-islands pattern that shows up in 40%+ of hard SQL questions. Identify consecutive login streaks:

-- Gaps and islands: find consecutive login streaks
WITH islands AS (
  SELECT
    user_id,
    login_date,
    login_date - INTERVAL '1 day' * ROW_NUMBER() OVER(
      PARTITION BY user_id
      ORDER BY login_date
    ) AS island_key
  FROM logins
)
SELECT
  user_id,
  MIN(login_date) AS streak_start,
  MAX(login_date) AS streak_end,
  COUNT(*) AS streak_length
FROM islands
GROUP BY user_id, island_key
ORDER BY streak_length DESC;

If that island_key trick doesn't make immediate sense, you need more reps. The insight: subtracting a sequential ROW_NUMBER from a date collapses consecutive days into the same key. Non-consecutive days produce different keys. It's elegant, it's unintuitive the first time, and interviewers love it.

Week 3: CTEs, Self-Joins, Query Optimization

Write every query using CTEs. Practice recursive CTEs for hierarchical data. Drill self-joins for comparing rows within the same table (finding users whose spend increased month-over-month). Learn to read execution plans, at least enough to spot full table scans and missing indexes.

Week 4: Mock Interviews Under Pressure

The research is clear: most candidates don't fail because of SQL syntax. They fail because they can't connect everything under pressure and communicate their reasoning. Practice with a timer. State assumptions about the data before writing. Narrate your approach. Use the mock interview simulator if you want realistic pressure without burning a real loop.


The salary stakes are real. The gap between "can pass SQL screening" and "cannot" maps to roughly $40K annually at mid-level roles. Senior big data engineers with Spark expertise command $155K-$200K, but they have to get past the SQL screen first. A Spark cert with rusty window functions gets you rejected at step one of a process with a 2-3% conversion rate.

I've been through three waves of "the hot new thing will replace SQL." Still here. Still the screener. Still the skill that separates candidates who think in sets from candidates who think in loops. The tools change every 18 months. PARTITION BY has been the same for 20 years.

Stop grinding Spark API trivia. Open a SQL editor. Write a running total. Write a deduplication. Write a gaps-and-islands. Do it until it's boring. Then do it under a timer until it's fast. That's the data engineer interview prep SQL plan that actually matches what companies are testing in 2026.

Play the game, win the prize.

data engineer SQL interview 2026window functions data engineering interviewSQL vs Spark data engineer hiringdata engineering skills 2026data engineer interview prep SQL
02 / Why practice

Try the actual problems

  1. 01

    Active recall beats re-reading by 50%

    Cognitive-science meta-reviews (Dunlosky et al., 2013) rank practice testing as a top-tier study technique, while re-reading and highlighting rank near the bottom

  2. 02

    76% of hiring managers reject on the coding task, not the resume

    From HackerRank's 2024 Developer Skills Report. Candidates who look strong on paper still fail the live screen if they haven't done timed, executable practice

  3. 03

    Five problem shapes cover 80% of data engineer loops

    Dedup, sessionization, top-N-per-group, slowly-changing dimensions, partition tricks. Writing the shapes by hand turns the unfamiliar into pattern recognition