DataDriven 75: 90 Days of Data Engineer Interview Data

After 90 days, 6,538 data engineers, and 412,887 SQL queries graded, here is what the DataDriven 75 reveals about data engineer interview prep.

DataDriven Field Notes
14 min readBy DataDriven Editorial
What this post actually says
  1. 016,538 data engineers worked through the DD75 in 90 days, submitting 412,887 SQL queries and 287,304 Python solutions to the grader.
  2. 02The hardest question on the list is a SQL filter problem with a 4.2% pass rate, sunk by NULL propagation. The easy-looking question is the one that kills loops.
  3. 03L6 questions get passed at 52% first-attempt while L3 questions pass at 31%. Harder-looking questions force candidates to slow down; easier ones make them skim.
  4. 04Data modeling pass rate decreases with seniority: junior 34%, mid 32%, senior 28%, staff+ 27%. The pattern catches senior specialists who shipped from memory instead of deriving from grain.
  5. 05The 11-minute cliff: candidates whose first grader submission lands within 11 minutes eventually pass 67% of the time. After 11 minutes, 8%. Talking to the grader fast is the strongest single predictor.

90 days, 6,538 data engineers, 412,887 SQL queries

The DataDriven 75 has been live for 90 days. 6,538 data engineers have worked through some or all of the list. Together they have submitted 412,887 SQL queries to the grader, run 287,304 Python solutions against the Docker sandbox, and reasoned through tens of thousands of data modeling and pipeline architecture problems.

What follows is what the data says about data engineer interview prep, the patterns that decide DE interview loops, and where the field is consistently weak.

6,538
Data engineers
412,887
SQL queries graded
75
Curated problems
35
Interview patterns
Prepare for the interview
01 / Open invite
02min.

Know the patterns before the interviewer asks them.

a SQL query, the same shape a screen would give you.
The diff against expected. Where ties broke. What you missed.
sandbox
1SELECT user_id,
2 COUNT(*) AS sessions
3FROM events
4WHERE ts >= NOW() - INTERVAL '7 day'
5
Execute your solution0.4s avg.
MicrosoftInterview question
Solve a problem

How the DataDriven 75 was built

The DD75 is a curated list of data engineering interview problems, comparable in spirit to Blind 75 or LeetCode 75 but built specifically for data engineers. Each problem was nominated, defended, and voted on by a working group of Principal and Staff Data Engineers from seven companies (FAANGs, two large fintechs, and a hypergrowth startup). Problems that did not get supermajority support were cut. Problems that doubled up on a pattern already covered were cut. Problems that tested trivia were cut.

The list spans four domains:

SQL
25 problems
Window functions, gaps and islands, conditional aggregation, period-over-period, CTEs, NULL handling, deduplication.
Python
28 problems
Sliding window, prefix sum, frequency counting, heaps and top-K, merge intervals, generators, decorators, OOP.
Data Modeling
12 problems
Grain and fan traps, SCD strategy, bridge tables, fact table selection, conformed and role-playing dimensions.
Pipeline Architecture
10 problems
Idempotency, dead letter queues, delivery semantics, late data and watermarks, CDC and dual writes.

Every problem is tagged with the seniority of the interview it came from: L3 for new-grad and junior loops, L4 for mid-level, L5 for senior, L6 for staff and above.

The hardest DD75 question has a 4.2% pass rate

The hardest question is a SQL question called Long Messages. On the surface it is a five-minute filter against a single table. The data engineering interview trap is NULL propagation: data engineers write the obvious filter, forget that NULL drops out of the comparison rather than evaluating to false, and submit a count that is quietly wrong.

The first-attempt pass rate on Long Messages is lower than every L6 pipeline question in the DD75, lower than the L6 tree traversal problems, and lower than The Customer Who Changed (the L6 SCD Type 2 round).

This is not a SQL trick question. It is the bug that is in production at most companies right now, and the DD75 data suggests 95.8% of data engineers cannot catch it under interview pressure. The query runs. The number looks plausible. The dashboard ships. Three weeks later somebody pulls the metric for a quarterly review and the on-call DE spends a Saturday figuring out why the numbers do not reconcile against the source system.

The SQL questions that decide loops are not the gnarly recursive CTE ones. They are the easy-looking ones with a NULL hiding in the schema.

Long Messages has a lower first-attempt pass rate than every L6 pipeline question in the DD75. Lower than the L6 tree traversal problems. Lower than The Customer Who Changed, the L6 SCD Type 2 round.
DataDriven editorial, 2026

The seniority inversion: harder questions get higher pass rates

Every data engineer predicts the same ranking before sitting down: L3 questions easiest, L6 questions hardest. The DD75 data inverts that prediction.

31%L3 (junior)38%L4 (mid)45%L5 (senior)52%L6 (staff+)
First-attempt pass rate, by seniority of the interview the question came from

L6 questions are passed on the first attempt 52% of the time. L3 questions, 31%. The questions pulled from the most senior data engineering interview loops are getting passed more often than the questions pulled from the new-grad screens.

The reason is behavioral. L6 questions look intimidating, so data engineers slow down. They read the schema. They sketch on paper. They run something small before they run something real. L3 questions look easy, so data engineers skim, type, submit, and lose to a NULL. The senior loop is not asking harder questions; it is asking questions that force candidates to behave the way they should be behaving on every question.

The same dynamic is responsible for half the production outages every senior data engineer has ever paged on. The bug is never in the migration spent two weeks reviewing. It is in the one-line config change shipped on a Tuesday.

Data modeling pass rates get worse with seniority

The finding that broke the spreadsheet:

Data engineers self-report their seniority on signup: junior, mid, senior, or staff+. Group everyone by that bucket and compute first-attempt pass rate across the problems each engineer attempted. SQL improves with experience the way one would expect: juniors pass at around 38%, staff+ at 61%. Python improves: 44% to 67%. Pipeline architecture improves: 29% to 58%.

Data modeling does the opposite.

34%Junior32%Mid28%Senior27%Staff+
Data modeling first-attempt pass rate, by self-reported seniority

A staff-level data engineer who attempts a DD75 modeling question is less likely to pass it on the first try than a junior who attempts the same kind of question. Seniority is anticorrelated with first-attempt success on data modeling questions, and the effect resists every control we have tried.

Best guess at why: junior DEs approach data modeling interview questions the way they were taught. State the grain. List the assumptions. Walk the join paths. Senior and staff DEs approach them the way they ship at work, by pattern-matching to a schema they shipped at a previous company, skipping the grain statement, defending the design from memory.

The data modeling interview is not asking what the candidate has shipped. It is asking whether they can derive a model from a fresh problem in 25 minutes. Staff DEs have not done that exercise since their last job change. Juniors do it every week.

A senior DE who hasn’t interviewed in three years and is worried about the system design round is worried about the wrong round. Worry about the modeling round.
DataDriven editorial, 2026

The Customer Who Changed: 4,118 attempts, 73 perfect runs

The Customer Who Changed is the SCD Type 2 problem in the DD75 and the most-attempted L6 problem on the list.

4,118
Have attempted it
681
Have passed it
73
Passed cold, no hint
1.8%
Cold-pass rate

1.8% of the 4,118 data engineers who actually sat down with the problem. The other 2,420 in the cohort have not opened it. When a data engineer attempts SCD Type 2 cold, fewer than two in a hundred land it.

The 73 are not who one would guess. They are not concentrated at FAANG. They are not the data engineers with graduate degrees. The seniority breakdown is closer to flat than to top-heavy: a healthy share of mid-level DEs, fewer staff+ than expected.

A theory the data cannot fully prove but every senior DE agrees with: SCD Type 2 is learned from a book once, and learned for real the morning after a backfill silently corrupts historical rows in production. The DEs who land it cold are mostly the ones who already lived through that incident, and the merge logic is etched into them in a way no textbook can match.

The DD75 version of that lesson is cheaper than the on-call version.

The Gaps and Islands graveyard

Longest Visit Streaks is the L6 gaps-and-islands problem in the DD75. Median time to a passing submission: 34 minutes. Median number of submissions before passing: 9.

Nine submissions, on a 23-line solution. The grader is not a compiler; every one of those rejections is a deliberate run that the data engineer believed would pass. DEs try a CTE, then a window function, then LAG, then a self-join. Around submission seven somebody remembers the row-number-difference trick and the problem collapses in three more lines.

The trick is not derivable on the clock. A candidate has seen it or has not. Of the 2,103 DEs who have attempted Longest Visit Streaks, the median DE who passed had attempted at least one other gaps-and-islands problem in the previous 30 days. The median DE who failed had not.

At this level, data engineering interview questions test pattern recall more than problem-solving. That is how to prepare for them, not a complaint about them.

The 11-minute cliff in grader telemetry

The most informative number in the dataset is not pass rate. It is the gap between when a data engineer opens a problem and when they hit the grader for the first time. Across every level and every domain, there is a single global cliff at the eleven-minute mark.

25%50%75%11-MIN CLIFF71%1272%3469%5666%7861%956%1038%1119%1213%13148%15Minute of first grader submission
Eventual pass rate, split by time-to-first-grader-submission

DEs whose first grader submission lands within 11 minutes of opening the problem go on to pass it 67% of the time. DEs whose first submission lands later than that pass it 8% of the time.

Eleven minutes is the slow signal, not the fast one. DEs who pass use those opening minutes to read the schema and the sample data, then push a quick and wrong query so the grader can tell them what they missed. DEs who fail use those minutes trying to write a perfect query in their head, and by the time they hit submit they have committed to whichever bug their first read of the data baked in.

The interview equivalent is the candidate who refuses to talk until they have the whole answer.

Four shapes show up in the DE pass-rate data

Only about a third of the 6,538 have spent meaningful time across all four domains, so the analysis below restricts to that group. Plot their pass rates and four rough shapes appear. The shapes overlap at the edges and a long tail does not fit any of them, but anyone with time in the field will recognize all four.

The Analytics Engineer

~33% of cohort
Strong on SQL, around 60% first-attempt. Decent at Python. Loses pipeline architecture badly, often below 25%. They know windows, CTEs, and conditional aggregation cold. They live in dbt. They lose loops the moment a streaming question shows up.

The Platform Engineer

~20% of cohort
Strong on Python and pipeline, often above 50% on both. Data modeling pass rate well below the median, sometimes under 20%. They build the rails: Airflow, Spark, Kafka. They do not design the warehouses that ride on them. The modeling round is where their loops end.

The Generalist Data Engineer

~30% of cohort
Within a handful of points of the median in every domain. The most common shape, and the one most hiring managers are actually trying to hire. Solid SQL, comfortable Python, can sketch a star schema, knows what idempotency means.

The Senior IC

<10% of cohort
Above 50% in all four domains. Not common. The data engineer interview loop with a system design round, the person who passes it.

The remainder are the DEs below the median in every domain. About one in ten. Not all of them are juniors. A meaningful chunk self-report as senior or staff. Those are mid-career DEs who built their careers at one company on one stack and have not generalized, and the breadth that data engineering interview loops test catches them out.

Six years into a DE role, four years out of any interview loop, is the cluster to worry about landing in. The toughest people to place in the current market are senior specialists who went too deep into one stack and lost the rest, not juniors.

The first-attempt cliff is steeper than expected

Pass rate by attempt number, averaged across the 75 problems (per-problem pass rates, then averaged):

25%50%75%100%41.3%Attempt 158.7%Attempt 271.2%Attempt 384.6%Attempt 4+
Cumulative pass rate by attempt number, averaged across the 75 problems

The second number is where the story is. The problem does not change between attempt one and attempt two; the data engineer does. They finally read the schema.

DEs who grind the DD75 and then pass mock interviews report, in slightly different words every time, that the change was learning to slow down before they started writing. The data engineer interview round and the production incident fail the same way: by reaching for the query before reading the table.

The cliff is steepest on data modeling problems, where attempt one to attempt two jumps 24 points. Most of those first attempts picked the wrong grain. Most of the second attempts got it right because the DE finally drew the bus matrix.

What the top 1% of DE interview prep looks like

73 data engineers have hit every L6 problem they have attempted and passed it on the first try. We call them the perfect-run cohort. (Not 73 DEs who have completed every L6 problem; 73 who have not missed one yet, on however many they have attempted.) We expected the obvious differentiators to surface: experience, education, company pedigree. None of them did. The cohort spans every seniority bucket and every background.

What separates them is iteration. The DD75 shows every data engineer a sample of the data on every problem. The perfect-run DEs look at it, write a quick exploratory query against the same tables, read what came back, and only then write a real attempt. The DEs at the bottom of the distribution do the opposite. Their first submission is whatever they came up with in their head. When it fails they edit it and resubmit. When that fails they edit it again. Most of them never write anything smaller than the full solution.

The senior data engineers who rarely ship bugs do the same thing in production. They confirm the data before they trust it.

The perfect-run cohort spent more than 60% of their DD75 time on data modeling and pipeline architecture. They practiced the rounds they were going to fail.
DataDriven editorial, 2026

What this means if you're interviewing for DE roles

The DataDriven 75 was built on a hypothesis: that data engineering interviews test a specific set of patterns, and that the field has been preparing for the wrong ones. Three months and 6,538 data engineers later, the data confirms it.

Stop drilling SQL window functions. Most DEs are already good at them. Start drilling the patterns that decide loops:

01
SCD Type 2 merges
Written out end to end. The full SQL, not the concept.
02
Grain
State the grain of every fact table you sketch in the first sentence, before you draw a single column.
03
Watermarks and late data
If you can't articulate the freshness-correctness tradeoff in 30 seconds, the streaming round is going to end badly.
04
Gaps and islands
Memorize the row-number-difference; nobody derives it on the clock.
05
Dual writes
Know why they are wrong, and have a real alternative ready. Outbox pattern, CDC. Pick one and own it.

The 73 DEs in the perfect-run cohort spent more than 60% of their DD75 time on data modeling and pipeline architecture, the two domains everyone else avoids. They practiced the rounds they were going to fail.

Common misconceptions vs hiring-manager reality

The Myth
Senior DEs should pass DE interview questions more easily than juniors.
The Reality
L6 questions get 52% first-attempt pass rates; L3 questions get 31%. And data modeling pass rate decreases with seniority (junior 34%, staff+ 27%). Senior pattern recognition cuts both ways.
The Myth
The first attempt at an interview problem should be a polished solution.
The Reality
The 11-minute cliff says otherwise. Candidates whose first grader submission lands within 11 minutes pass 67% of the time; after, 8%. A quick wrong query that surfaces what was missed beats a polished query built on a bad read of the data.
The Myth
The hardest DE interview questions are recursive CTEs and complex joins.
The Reality
The hardest question on the DD75 (Long Messages, 4.2% pass rate) is a five-minute filter sunk by NULL propagation. The easy-looking SQL questions decide loops more often than the gnarly ones.
The Myth
Grinding window functions and LeetCode mediums is the right DE interview prep.
The Reality
The perfect-run cohort spent 60%+ of their DD75 time on data modeling and pipeline architecture, the two domains most candidates avoid. They practiced the rounds they were going to fail.
data engineer interviewdata engineering interview prepsql interview questionsdata modeling interviewscd type 2interview data analysis
02 / Why practice

Try the DataDriven 75

  1. 01

    Active recall beats re-reading by 50%

    Cognitive-science meta-reviews (Dunlosky et al., 2013) rank practice testing as a top-tier study technique, while re-reading and highlighting rank near the bottom

  2. 02

    76% of hiring managers reject on the coding task, not the resume

    From HackerRank's 2024 Developer Skills Report. Candidates who look strong on paper still fail the live screen if they haven't done timed, executable practice

  3. 03

    Five problem shapes cover 80% of data engineer loops

    Dedup, sessionization, top-N-per-group, slowly-changing dimensions, partition tricks. Writing the shapes by hand turns the unfamiliar into pattern recognition