The DataDriven 75 has been live for 90 days. 6,538 data engineers have worked through some or all of the list. Together they have submitted 412,887 SQL queries to the grader, run 287,304 Python solutions against the Docker sandbox, and reasoned through tens of thousands of data modeling and pipeline architecture problems.
This is what the data says about data engineer interview prep, the patterns that decide DE interview loops, and where the field is consistently weak.
How the DataDriven 75 was built
The DD75 is a curated list of data engineering interview problems, comparable in spirit to Blind 75 or LeetCode 75 but built specifically for data engineers. Each problem was nominated, defended, and voted on by a working group of Principal and Staff Data Engineers from seven companies (FAANGs, two large fintechs, and a hypergrowth startup). Problems that did not get supermajority support were cut. Problems that doubled up on a pattern already covered were cut. Problems that tested trivia were cut.
The list spans four domains:
Every problem is tagged with the seniority of the interview it came from: L3 for new-grad and junior loops, L4 for mid-level, L5 for senior, L6 for staff and above.
The hardest data engineer interview question on the DD75 has a 4.2% pass rate
It is a SQL question called Long Messages. On the surface it is a five-minute filter against a single table. The data engineering interview trap is NULL propagation: data engineers write the obvious filter, forget that NULL drops out of the comparison rather than evaluating to false, and submit a count that is quietly wrong.
Long Messages has a lower first-attempt pass rate than every L6 pipeline question in the DD75. Lower than the L6 tree traversal problems. Lower than The Customer Who Changed, the L6 SCD Type 2 round.
This is not a SQL trick question. This is the bug that is in production at your company right now, and the DD75 data suggests 95.8% of data engineers cannot catch it under interview pressure. The query runs. The number looks plausible. The dashboard ships. Three weeks later somebody pulls the metric for a quarterly review and the on-call data engineer spends a Saturday figuring out why the numbers do not reconcile against the source system.
The takeaway for anyone studying for a data engineer interview: the SQL questions that decide loops are not the gnarly recursive CTE ones. They are the easy-looking ones with a NULL hiding in the schema.
The seniority inversion: harder-looking questions get higher pass rates
Every data engineer we talk to predicts the same ranking before sitting down: L3 questions easiest, L6 questions hardest. The DD75 data inverts that prediction.
L6 questions are passed on the first attempt 52% of the time. L3 questions, 31%. The questions pulled from the most senior data engineering interview loops are getting passed more often than the questions pulled from the new-grad screens.
The reason is behavioral. L6 questions look intimidating, so data engineers slow down. They read the schema. They sketch on paper. They run something small before they run something real. L3 questions look easy, so data engineers skim, type, submit, and lose to a NULL. The senior loop is not asking harder questions; it is asking questions that force candidates to behave the way they should be behaving on every question.
The same dynamic is responsible for half the production outages every senior data engineer has ever paged on. The bug is never in the migration you spent two weeks reviewing. It is in the one-line config change you shipped on a Tuesday.
Data modeling pass rates get worse with seniority
This is the finding that broke the spreadsheet.
Data engineers self-report their seniority when they sign up: junior, mid, senior, or staff+. Group everyone by that bucket and compute first-attempt pass rate across the problems each data engineer actually attempted. SQL improves with experience the way you would expect: juniors pass at around 38%, staff+ at 61%. Python improves: 44% to 67%. Pipeline architecture improves: 29% to 58%.
Data modeling does the opposite.
A staff-level data engineer who attempts a DD75 modeling question is less likely to pass it on the first try than a junior who attempts the same kind of question. Seniority is anticorrelated with first-attempt success on data modeling questions, and we cannot find a way to make the effect go away.
Best guess at why: junior DEs approach data modeling interview questions the way they were taught. State the grain. List the assumptions. Walk the join paths. Senior and staff DEs approach them the way they ship at work, by pattern-matching to a schema they shipped at a previous company, skipping the grain statement, defending the design from memory.
The data modeling interview is not asking what you have shipped. It is asking whether you can derive a model from a fresh problem in 25 minutes. Staff DEs have not done that exercise since their last job change. Juniors do it every week.
If you are senior, haven't interviewed in three years, and you're worried about the system design round, you're worried about the wrong round. Worry about the modeling round.
The Customer Who Changed: 4,118 attempts, 73 perfect runs
The Customer Who Changed is the SCD Type 2 problem in the DD75 and the most-attempted L6 problem on the list.
That is 1.8% of the 4,118 data engineers who actually sat down with the problem. The other 2,420 in the cohort have not opened it. When a data engineer attempts SCD Type 2 cold, fewer than two in a hundred land it.
The 73 are not who you would guess. They are not concentrated at FAANG. They are not the data engineers with graduate degrees. The seniority breakdown is closer to flat than to top-heavy: a healthy share of mid-level DEs, fewer staff+ than you would think.
A theory the data cannot fully prove but every senior DE we mention it to agrees with: you learn SCD Type 2 from a book once, and you learn it for real the morning after a backfill silently corrupts historical rows in production. The DEs who land it cold are mostly the ones who already lived through that incident, and the merge logic is etched into them in a way no textbook can match.
The DD75 version of that lesson is cheaper than the on-call version.
The Gaps and Islands graveyard
Longest Visit Streaks is the L6 gaps-and-islands problem in the DD75. Median time to a passing submission: 34 minutes. Median number of submissions before passing: 9.
Nine submissions, on a 23-line solution. The grader is not a compiler; every one of those rejections is a deliberate run that the data engineer believed would pass. DEs try a CTE, then a window function, then LAG, then a self-join. Around submission seven somebody remembers the row-number-difference trick and the problem collapses in three more lines.
That trick is not derivable on the clock. You have seen it or you have not. Of the 2,103 DEs who have attempted Longest Visit Streaks, the median DE who passed had attempted at least one other gaps-and-islands problem in the previous 30 days. The median DE who failed had not.
At this level, data engineering interview questions test pattern recall more than problem-solving. That is how to prepare for them, not a complaint about them.
The 11-minute cliff in DD75 grader telemetry
The most informative number in the dataset is not pass rate. It is the gap between when a data engineer opens a problem and when they hit the grader for the first time. Across every level and every domain, there is a single global cliff at the eleven-minute mark.
DEs whose first grader submission lands within 11 minutes of opening the problem go on to pass it 67% of the time. DEs whose first submission lands later than that pass it 8% of the time.
Eleven minutes is the slow signal, not the fast one. DEs who pass use those opening minutes to read the schema and the sample data, then push a quick and wrong query so the grader can tell them what they missed. DEs who fail use those minutes trying to write a perfect query in their head, and by the time they hit submit they have committed to whichever bug their first read of the data baked in.
The interview equivalent is the candidate who refuses to talk until they have the whole answer.
Four shapes show up in the data engineer interview pass-rate data
Only about a third of the 6,538 have spent meaningful time across all four domains, so the analysis below restricts to that group. Plot their pass rates and four rough shapes appear. The shapes overlap at the edges and a long tail does not fit any of them, but if you have spent time in the data engineering field you will recognize all four.
The Analytics Engineer
The Platform Engineer
The Generalist Data Engineer
The Senior IC
What is left are the data engineers below the median in every domain. About one in ten. Not all of them are juniors. A meaningful chunk self-report as senior or staff. Those are mid-career DEs who built their careers at one company on one stack and have not generalized, and the breadth that data engineering interview loops test catches them out.
If you are six years into a DE role and have not interviewed in four, this is the cluster you should worry about landing in. The toughest people to place in the current market are senior specialists who went too deep into one stack and lost the rest, not juniors.
The first-attempt cliff is steeper than you think
Pass rate by attempt number, averaged across the 75 problems (per-problem pass rates, then averaged):
Stare at the second number. The problem does not change between attempt one and attempt two; the data engineer does. They finally read the schema.
DEs who grind the DD75 and then pass mock interviews tell us, in slightly different words every time, that the change was learning to slow down before they started writing. The data engineer interview round and the production incident fail the same way: by reaching for the query before reading the table.
The cliff is steepest on data modeling problems, where attempt one to attempt two jumps 24 points. Most of those first attempts picked the wrong grain. Most of the second attempts got it right because the DE finally drew the bus matrix.
What the top 1% of data engineer interview prep looks like
73 data engineers have hit every L6 problem they have attempted and passed it on the first try. We call them the perfect-run cohort. (Not 73 DEs who have completed every L6 problem; 73 who have not missed one yet, on however many they have attempted.) We expected the obvious differentiators to surface: experience, education, company pedigree. None of them did. The cohort spans every seniority bucket and every background.
What separates them is iteration. The DD75 shows every data engineer a sample of the data on every problem. The perfect-run DEs look at it, write a quick exploratory query against the same tables, read what came back, and only then write a real attempt. The DEs at the bottom of the distribution do the opposite. Their first submission is whatever they came up with in their head. When it fails they edit it and resubmit. When that fails they edit it again. Most of them never write anything smaller than the full solution.
The senior data engineers we know who rarely ship bugs do the same thing. They confirm the data before they trust it.
The perfect-run cohort spent more than 60% of their DD75 time on data modeling and pipeline architecture. They practiced the rounds they were going to fail.
What this means if you are interviewing for a data engineer role
The DataDriven 75 was built on a hypothesis: that data engineering interviews test a specific set of patterns, and that the field has been preparing for the wrong ones. Three months and 6,538 data engineers later, here is what the data says.
Stop drilling SQL window functions. You are already good at them. Start drilling the patterns that decide loops:
The 73 DEs in the perfect-run cohort spent more than 60% of their DD75 time on data modeling and pipeline architecture, the two domains everyone else avoids. They practiced the rounds they were going to fail.
Try the DataDriven 75
All 75 questions are free to solve, forever. Work through the first ten and the pass-rate spread will tell you which of the four shapes you fit.
Start practicing