DataDriven 75: 90 Days of Data Engineer Interview Data
After 90 days, 6,538 data engineers, and 412,887 SQL queries graded, here is what the DataDriven 75 reveals about data engineer interview prep.
- 016,538 data engineers worked through the DD75 in 90 days, submitting 412,887 SQL queries and 287,304 Python solutions to the grader.
- 02The hardest question on the list is a SQL filter problem with a 4.2% pass rate, sunk by NULL propagation. The easy-looking question is the one that kills loops.
- 03L6 questions get passed at 52% first-attempt while L3 questions pass at 31%. Harder-looking questions force candidates to slow down; easier ones make them skim.
- 04Data modeling pass rate decreases with seniority: junior 34%, mid 32%, senior 28%, staff+ 27%. The pattern catches senior specialists who shipped from memory instead of deriving from grain.
- 05The 11-minute cliff: candidates whose first grader submission lands within 11 minutes eventually pass 67% of the time. After 11 minutes, 8%. Talking to the grader fast is the strongest single predictor.
90 days, 6,538 data engineers, 412,887 SQL queries
The DataDriven 75 has been live for 90 days. 6,538 data engineers have worked through some or all of the list. Together they have submitted 412,887 SQL queries to the grader, run 287,304 Python solutions against the Docker sandbox, and reasoned through tens of thousands of data modeling and pipeline architecture problems.
What follows is what the data says about data engineer interview prep, the patterns that decide DE interview loops, and where the field is consistently weak.
Know the patterns before the interviewer asks them.
How the DataDriven 75 was built
The DD75 is a curated list of data engineering interview problems, comparable in spirit to Blind 75 or LeetCode 75 but built specifically for data engineers. Each problem was nominated, defended, and voted on by a working group of Principal and Staff Data Engineers from seven companies (FAANGs, two large fintechs, and a hypergrowth startup). Problems that did not get supermajority support were cut. Problems that doubled up on a pattern already covered were cut. Problems that tested trivia were cut.
The list spans four domains:
Every problem is tagged with the seniority of the interview it came from: L3 for new-grad and junior loops, L4 for mid-level, L5 for senior, L6 for staff and above.
The hardest DD75 question has a 4.2% pass rate
The hardest question is a SQL question called Long Messages. On the surface it is a five-minute filter against a single table. The data engineering interview trap is NULL propagation: data engineers write the obvious filter, forget that NULL drops out of the comparison rather than evaluating to false, and submit a count that is quietly wrong.
The first-attempt pass rate on Long Messages is lower than every L6 pipeline question in the DD75, lower than the L6 tree traversal problems, and lower than The Customer Who Changed (the L6 SCD Type 2 round).
This is not a SQL trick question. It is the bug that is in production at most companies right now, and the DD75 data suggests 95.8% of data engineers cannot catch it under interview pressure. The query runs. The number looks plausible. The dashboard ships. Three weeks later somebody pulls the metric for a quarterly review and the on-call DE spends a Saturday figuring out why the numbers do not reconcile against the source system.
The SQL questions that decide loops are not the gnarly recursive CTE ones. They are the easy-looking ones with a NULL hiding in the schema.
“Long Messages has a lower first-attempt pass rate than every L6 pipeline question in the DD75. Lower than the L6 tree traversal problems. Lower than The Customer Who Changed, the L6 SCD Type 2 round.”
The seniority inversion: harder questions get higher pass rates
Every data engineer predicts the same ranking before sitting down: L3 questions easiest, L6 questions hardest. The DD75 data inverts that prediction.
L6 questions are passed on the first attempt 52% of the time. L3 questions, 31%. The questions pulled from the most senior data engineering interview loops are getting passed more often than the questions pulled from the new-grad screens.
The reason is behavioral. L6 questions look intimidating, so data engineers slow down. They read the schema. They sketch on paper. They run something small before they run something real. L3 questions look easy, so data engineers skim, type, submit, and lose to a NULL. The senior loop is not asking harder questions; it is asking questions that force candidates to behave the way they should be behaving on every question.
The same dynamic is responsible for half the production outages every senior data engineer has ever paged on. The bug is never in the migration spent two weeks reviewing. It is in the one-line config change shipped on a Tuesday.
Data modeling pass rates get worse with seniority
The finding that broke the spreadsheet:
Data engineers self-report their seniority on signup: junior, mid, senior, or staff+. Group everyone by that bucket and compute first-attempt pass rate across the problems each engineer attempted. SQL improves with experience the way one would expect: juniors pass at around 38%, staff+ at 61%. Python improves: 44% to 67%. Pipeline architecture improves: 29% to 58%.
Data modeling does the opposite.
A staff-level data engineer who attempts a DD75 modeling question is less likely to pass it on the first try than a junior who attempts the same kind of question. Seniority is anticorrelated with first-attempt success on data modeling questions, and the effect resists every control we have tried.
Best guess at why: junior DEs approach data modeling interview questions the way they were taught. State the grain. List the assumptions. Walk the join paths. Senior and staff DEs approach them the way they ship at work, by pattern-matching to a schema they shipped at a previous company, skipping the grain statement, defending the design from memory.
The data modeling interview is not asking what the candidate has shipped. It is asking whether they can derive a model from a fresh problem in 25 minutes. Staff DEs have not done that exercise since their last job change. Juniors do it every week.
“A senior DE who hasn’t interviewed in three years and is worried about the system design round is worried about the wrong round. Worry about the modeling round.”
The Customer Who Changed: 4,118 attempts, 73 perfect runs
The Customer Who Changed is the SCD Type 2 problem in the DD75 and the most-attempted L6 problem on the list.
1.8% of the 4,118 data engineers who actually sat down with the problem. The other 2,420 in the cohort have not opened it. When a data engineer attempts SCD Type 2 cold, fewer than two in a hundred land it.
The 73 are not who one would guess. They are not concentrated at FAANG. They are not the data engineers with graduate degrees. The seniority breakdown is closer to flat than to top-heavy: a healthy share of mid-level DEs, fewer staff+ than expected.
A theory the data cannot fully prove but every senior DE agrees with: SCD Type 2 is learned from a book once, and learned for real the morning after a backfill silently corrupts historical rows in production. The DEs who land it cold are mostly the ones who already lived through that incident, and the merge logic is etched into them in a way no textbook can match.
The DD75 version of that lesson is cheaper than the on-call version.
The Gaps and Islands graveyard
Longest Visit Streaks is the L6 gaps-and-islands problem in the DD75. Median time to a passing submission: 34 minutes. Median number of submissions before passing: 9.
Nine submissions, on a 23-line solution. The grader is not a compiler; every one of those rejections is a deliberate run that the data engineer believed would pass. DEs try a CTE, then a window function, then LAG, then a self-join. Around submission seven somebody remembers the row-number-difference trick and the problem collapses in three more lines.
The trick is not derivable on the clock. A candidate has seen it or has not. Of the 2,103 DEs who have attempted Longest Visit Streaks, the median DE who passed had attempted at least one other gaps-and-islands problem in the previous 30 days. The median DE who failed had not.
At this level, data engineering interview questions test pattern recall more than problem-solving. That is how to prepare for them, not a complaint about them.
The 11-minute cliff in grader telemetry
The most informative number in the dataset is not pass rate. It is the gap between when a data engineer opens a problem and when they hit the grader for the first time. Across every level and every domain, there is a single global cliff at the eleven-minute mark.
DEs whose first grader submission lands within 11 minutes of opening the problem go on to pass it 67% of the time. DEs whose first submission lands later than that pass it 8% of the time.
Eleven minutes is the slow signal, not the fast one. DEs who pass use those opening minutes to read the schema and the sample data, then push a quick and wrong query so the grader can tell them what they missed. DEs who fail use those minutes trying to write a perfect query in their head, and by the time they hit submit they have committed to whichever bug their first read of the data baked in.
The interview equivalent is the candidate who refuses to talk until they have the whole answer.
Four shapes show up in the DE pass-rate data
Only about a third of the 6,538 have spent meaningful time across all four domains, so the analysis below restricts to that group. Plot their pass rates and four rough shapes appear. The shapes overlap at the edges and a long tail does not fit any of them, but anyone with time in the field will recognize all four.
The Analytics Engineer
The Platform Engineer
The Generalist Data Engineer
The Senior IC
The remainder are the DEs below the median in every domain. About one in ten. Not all of them are juniors. A meaningful chunk self-report as senior or staff. Those are mid-career DEs who built their careers at one company on one stack and have not generalized, and the breadth that data engineering interview loops test catches them out.
Six years into a DE role, four years out of any interview loop, is the cluster to worry about landing in. The toughest people to place in the current market are senior specialists who went too deep into one stack and lost the rest, not juniors.
The first-attempt cliff is steeper than expected
Pass rate by attempt number, averaged across the 75 problems (per-problem pass rates, then averaged):
The second number is where the story is. The problem does not change between attempt one and attempt two; the data engineer does. They finally read the schema.
DEs who grind the DD75 and then pass mock interviews report, in slightly different words every time, that the change was learning to slow down before they started writing. The data engineer interview round and the production incident fail the same way: by reaching for the query before reading the table.
The cliff is steepest on data modeling problems, where attempt one to attempt two jumps 24 points. Most of those first attempts picked the wrong grain. Most of the second attempts got it right because the DE finally drew the bus matrix.
What the top 1% of DE interview prep looks like
73 data engineers have hit every L6 problem they have attempted and passed it on the first try. We call them the perfect-run cohort. (Not 73 DEs who have completed every L6 problem; 73 who have not missed one yet, on however many they have attempted.) We expected the obvious differentiators to surface: experience, education, company pedigree. None of them did. The cohort spans every seniority bucket and every background.
What separates them is iteration. The DD75 shows every data engineer a sample of the data on every problem. The perfect-run DEs look at it, write a quick exploratory query against the same tables, read what came back, and only then write a real attempt. The DEs at the bottom of the distribution do the opposite. Their first submission is whatever they came up with in their head. When it fails they edit it and resubmit. When that fails they edit it again. Most of them never write anything smaller than the full solution.
The senior data engineers who rarely ship bugs do the same thing in production. They confirm the data before they trust it.
“The perfect-run cohort spent more than 60% of their DD75 time on data modeling and pipeline architecture. They practiced the rounds they were going to fail.”
What this means if you're interviewing for DE roles
The DataDriven 75 was built on a hypothesis: that data engineering interviews test a specific set of patterns, and that the field has been preparing for the wrong ones. Three months and 6,538 data engineers later, the data confirms it.
Stop drilling SQL window functions. Most DEs are already good at them. Start drilling the patterns that decide loops:
The 73 DEs in the perfect-run cohort spent more than 60% of their DD75 time on data modeling and pipeline architecture, the two domains everyone else avoids. They practiced the rounds they were going to fail.
Common misconceptions vs hiring-manager reality
Try the DataDriven 75
- 01
Active recall beats re-reading by 50%
Cognitive-science meta-reviews (Dunlosky et al., 2013) rank practice testing as a top-tier study technique, while re-reading and highlighting rank near the bottom
- 02
76% of hiring managers reject on the coding task, not the resume
From HackerRank's 2024 Developer Skills Report. Candidates who look strong on paper still fail the live screen if they haven't done timed, executable practice
- 03
Five problem shapes cover 80% of data engineer loops
Dedup, sessionization, top-N-per-group, slowly-changing dimensions, partition tricks. Writing the shapes by hand turns the unfamiliar into pattern recognition
Related interview prep
100 of the most asked data engineer interview questions across all four domains.
The full SQL interview problem set, indexed by topic, difficulty, and company.
The 50 most frequently asked data engineer interview questions, with worked answers.