Top 100 Data Engineer Interview Questions

The 100 questions most likely to decide a 2026 data engineering loop, grouped by the patterns interviewers draw from and ranked by how often they appear. Most run as live problems in your browser. Pair with the complete data engineer interview preparation framework.

How this list was built

Each of these 100 questions is weighted by how often it shows up in real data engineering loops: interview reports from 2024 through 2026 across FAANG, fintech, and mid-market companies, cross-checked against what candidates practice on this platform before onsites. Every question carries a frequency tier, a level band, and a worked answer short enough to deliver in the interview. Most link to a runnable problem so you can close the gap between recognizing an answer and producing one under time pressure.

The split is 40 SQL, 25 Python, 20 data modeling, 10 pipeline and system design, 5 behavioral. That mirrors loop composition: SQL appears in roughly 85% of loops, Python in about 70%, system design in about 65%, and data modeling in about 55%, with modeling carrying outsized weight in senior decisions. Treat the percentages as directional; they come from interview reports and vendor analyses, not a census.

How a data engineering loop is weighted

Question count in this list tracks how often each domain appears and how often it decides the outcome.

SQL

40 Qs

~85% of loops · 2 rounds

The screen gate: no onsite without it

Python

25 Qs

~70% of loops · 1 round

Wrangling fluency, not LeetCode

System design

10 Qs

~65% of loops · heavier at L5+

Idempotency, late data, cost trades

Data modeling

20 Qs

~55% of loops · blended at Meta

The round seniors fail most

Behavioral

5 Qs

Every loop · every round at Amazon

The tiebreaker in the debrief

Loop-share percentages are directional, synthesized from 2024-2026 interview reports and published loop analyses. Modeling appears in fewer loops than design but cuts more senior candidates than any other round.

What screens test vs what onsites test

A 25-minute screen is a speed test over the basics: one WHERE plus GROUP BY question, one join plus aggregation, one window function. The bar is fluency, not cleverness, and most screen rejections are time-outs on questions the candidate could have solved untimed. If your screen is next week, that is the argument for drilling the common patterns against a clock before touching anything exotic.

Onsites reuse the same patterns at depth and add the harder shapes: sessionization, gaps-and-islands, recursive queries, and the design follow-ups where offers actually get decided. A few topics work differently again: things like join skew and late-arriving data are rarely asked outright, but at L5 and above interviewers expect you to bring them up yourself. The frequency markers on each question encode this split, from near-certain screen material down to senior signals.

SQL · 40 of 100

SQL: the 40 that decide the screen

Every data engineering loop starts here, and most rejections happen here. These 40 cover the 8 patterns real screens draw from, ordered by how often they appear in interview reports.

On ~85% of loops · usually 2 rounds

Drill 938 SQL problems

Joins and Anti-Joins

Where fanout bugs and NULL traps quietly fail candidatesDrill 252 problems →

Explain INNER vs LEFT vs FULL OUTER JOIN, with row-count expectationsL3+Near-certain

INNER keeps only matched rows. LEFT keeps every left row and fills NULL where the right side has no match. FULL keeps everything from both sides. Then give the row-count expectations: INNER can return fewer rows than either input; LEFT returns at least the left row count, and more than it when the right side has duplicate keys. That last clause is the setup for the fanout question that usually follows.

The follow-up they askYour LEFT JOIN returned more rows than the left table. What happened, and is it a bug?

The Dormant Accounts~15mFreeRun it

A revenue total doubled after you added a join. Diagnose and fix itL4Near-certain

Classic fanout: the joined table has multiple rows per key (one order, three shipments), so every order row repeats and SUM inflates. Diagnose by comparing COUNT(*) before and after the join, or COUNT(DISTINCT order_id) against COUNT(*).

The fix is structural: aggregate the many-side down to the join's grain in a CTE, then join one-to-one. A DISTINCT inside the SUM treats the symptom and breaks the moment two shipments legitimately share an amount.

The follow-up they askWhen would you fix it with a window function instead of pre-aggregation?

User Spend Audit~10mFreeRun it

Find rows in A with no match in B (anti-join), three waysL4Near-certain

NOT EXISTS with a correlated subquery, LEFT JOIN with WHERE b.key IS NULL, and NOT IN. Then disqualify NOT IN yourself: if the subquery returns a single NULL, three-valued logic makes the whole predicate UNKNOWN and you silently get zero rows. NOT EXISTS is NULL-safe and usually plans best.

The follow-up they askWhich of the three does the optimizer typically turn into the same plan?

Users Without Sessions~10mFreeRun it

Why does NULL = NULL match nothing, and how does that break joins?L4Most loops

SQL equality with NULL evaluates to UNKNOWN, not true, so join keys containing NULL never match, including against other NULLs. Rows quietly vanish from INNER joins. If NULL should mean 'same', make it explicit: COALESCE both keys to a sentinel, or use IS NOT DISTINCT FROM where the dialect supports it. Mentioning that this is three-valued logic, and that WHERE filters also drop UNKNOWN, signals real depth.

The follow-up they askHow do NULLs in a GROUP BY behave differently than in a join key?

Self-join: find all pairs of employees who share a managerL4Most loops

JOIN employees e1 to e2 on e1.manager_id = e2.manager_id with e1.id < e2.id. The strict inequality removes self-pairs and keeps each pair exactly once. With != instead, every pair appears twice (once in each order), and that bug usually surfaces only when someone counts the output.

The follow-up they askSame table, now find employees who earn more than their own manager.

The Cannibalization Report~30mFreeRun it

Aggregation and CASE Pivots

The opener in nearly every screen; speed here buys time for the hard onesDrill 93 problems →

WHERE vs HAVING: what actually differs?L3+Near-certain

WHERE filters rows before aggregation; HAVING filters groups after. WHERE can use indexes and shrinks the data before the expensive grouping; HAVING cannot. The practical rule: any predicate that does not reference an aggregate belongs in WHERE, and moving it there is one of the cheapest query optimizations that exists.

The follow-up they askWhere does a window function's result get filtered, given neither WHERE nor HAVING can see it?

Activity Histogram~10mFreeRun it

COUNT(*) vs COUNT(col) vs COUNT(DISTINCT col)L3+Near-certain

COUNT(*) counts rows. COUNT(col) skips NULLs. COUNT(DISTINCT col) deduplicates and is dramatically more expensive at scale because it must track every value seen. On billions of rows the working tool is APPROX_COUNT_DISTINCT (HyperLogLog, roughly 2% error, constant memory); know where the error is acceptable, e.g. dashboard audience counts yes, billing no.

The follow-up they askWhy is COUNT(DISTINCT) so much harder to distribute across nodes than SUM?

Pivot rows into columns with conditional aggregationL4Near-certain

SUM(CASE WHEN status = 'failed' THEN 1 ELSE 0 END) AS failed, one expression per output column. It is portable across every dialect, unlike the PIVOT operator, and it composes with other aggregates in the same SELECT. Postgres's FILTER (WHERE ...) is a cleaner spelling of the same idea.

The follow-up they askThe categories are not known in advance. What are your options, and what does each cost?

Profit Tiers~5mFreeRun it

Find duplicate rows on a composite business key, then return the full rowsL4Most loops

GROUP BY the key with HAVING COUNT(*) > 1 finds the keys; to get the rows themselves, join back or use a windowed COUNT(*) OVER (PARTITION BY key) > 1. Add the NULL caveat: GROUP BY treats NULLs as equal but joins do not, so the join-back can miss NULL-keyed duplicates the GROUP BY found.

The follow-up they askNow delete the duplicates but keep one row per key. Which one do you keep, and why does that question matter?

Double Vision~10mFreeRun it

Compute a rate safely: clicks / impressions with zero and NULL guardsL4Most loops

Divide by NULLIF(impressions, 0) so division by zero yields NULL instead of an error, then COALESCE the result if the report needs a 0. Also cast to float first: integer division silently floors to 0 in several dialects and is one of the most common silent-wrong-answer bugs in screens.

The follow-up they askShould a 0/0 rate display as 0% or as NULL, and who decides?

Weekend Warriors~12mFreeRun it

Window Functions: Ranking and Dedup

The single highest-frequency pattern family in DE screensDrill 224 problems →

ROW_NUMBER vs RANK vs DENSE_RANK, and when each is wrongL3+Near-certain

On a tie, ROW_NUMBER picks an arbitrary winner, RANK leaves gaps (1, 1, 3), DENSE_RANK does not (1, 1, 2). Map the requirement to the semantics: 'second highest salary' wants DENSE_RANK; 'exactly one row per user' wants ROW_NUMBER plus a unique column at the end of the ORDER BY, because without that tiebreaker the query returns different rows on different runs.

The follow-up they askYour ROW_NUMBER dedup returns different rows on different runs. Why, and what is the fix?

Second Highest Salary~10mFreeRun it

Deduplicate a table keeping the latest row per userL4Near-certain

ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY updated_at DESC, id DESC) and keep rn = 1. The trailing id in the ORDER BY is the deterministic tiebreaker for same-timestamp rows. This exact shape is the bread and butter of warehouse work (CDC dedup, snapshot compaction), which is why it appears in so many screens.

The follow-up they askThe table is 10 billion rows. What does this dedup cost, and how would you avoid recomputing it daily?

The Freshest Record~10mFreeRun it

Top 3 products per category by revenue, ties includedL4Near-certain

Aggregate revenue per (category, product), apply DENSE_RANK() OVER (PARTITION BY category ORDER BY revenue DESC), filter rank <= 3 in an outer query. DENSE_RANK rather than ROW_NUMBER, because 'ties included' means two products with equal revenue both stay; ROW_NUMBER would cut one of them arbitrarily.

WITH revenue AS (
  SELECT p.category, p.product_name, SUM(t.total_amount) AS revenue
  FROM transactions t
  JOIN products p ON p.product_id = t.product_id
  GROUP BY p.category, p.product_name
)
SELECT category, product_name, revenue
FROM (
  SELECT *, DENSE_RANK() OVER (
    PARTITION BY category ORDER BY revenue DESC
  ) AS rk
  FROM revenue
)
WHERE rk <= 3;

The follow-up they askWhy can't you put the rank filter in the same SELECT's WHERE clause?

The Relentless Searchers~24mFreeRun it

What does QUALIFY do, and what do you write where it doesn't exist?L5Differentiator

QUALIFY filters on window-function results directly, replacing the wrap-in-a-CTE-and-filter dance. It exists in Snowflake, BigQuery, and DuckDB; Postgres, MySQL, and SparkSQL need the subquery, because window functions evaluate after WHERE and HAVING and neither clause can see their output.

The follow-up they askIn what order do FROM, WHERE, GROUP BY, HAVING, window functions, QUALIFY, and LIMIT logically execute?

Replicate ROW_NUMBER without window functions (the ANSI-SQL drill)L5Differentiator

Correlated count: SELECT COUNT(*) FROM t t2 WHERE t2.grp = t1.grp AND (t2.score > t1.score OR (t2.score = t1.score AND t2.id < t1.id)) + 1. Some loops (Meta's SQL round has a reputation for this) ask you to do window work in plain ANSI SQL to prove you understand what the window is doing rather than reciting syntax.

The follow-up they askWhat is the complexity of the correlated version vs the window version, and when does it actually matter?

Window Functions: Time Series

LAG, frames, and running math: the second-most-tested window familyDrill 190 problems →

Month-over-month revenue growth as a percentageL4Near-certain

Truncate to month, SUM, then LAG(revenue) OVER (ORDER BY month) and compute (current - previous) / NULLIF(previous, 0) * 100. Two details to handle without being prompted: NULLIF guards the first month, and 'months with no sales' need a decision, since LAG over a series with missing months silently compares across the gap.

The follow-up they askA month is missing entirely from the data. Why does LAG silently give you a wrong number, and what fixes it?

The Revenue Cliff~18mFreeRun it

7-day rolling average of daily revenueL4Near-certain

AVG(revenue) OVER (ORDER BY day ROWS BETWEEN 6 PRECEDING AND CURRENT ROW). Two edge cases decide the grade: the first six days average over a partial window, which may or may not be acceptable, and ROWS counts physical rows, so missing days silently widen the time span unless you join a date spine first.

The follow-up they askMake it a rolling 7-day window by calendar time, not by rows, on data with gaps.

Spending Velocity~10mFreeRun it

ROWS vs RANGE in a window frame: when do they differ?L5Differentiator

They differ exactly when ORDER BY has ties. ROWS counts physical rows; RANGE includes all peer rows with the same sort value, so duplicate timestamps make RANGE windows larger than you expect. And the default frame when you write ORDER BY with no frame is RANGE, not ROWS, which is how running totals end up jumping in steps. Write the frame explicitly every time.

The follow-up they askYour running total jumps in steps instead of row by row. Which default caused that?

Smooth Latency~5mFreeRun it

Forward-fill NULL readings with the last known value per deviceL5Differentiator

LAST_VALUE(reading IGNORE NULLS) OVER (PARTITION BY device ORDER BY ts ROWS UNBOUNDED PRECEDING), where the dialect supports IGNORE NULLS (Snowflake, BigQuery, Oracle). Postgres lacks it, so use the two-step trick: a windowed COUNT over non-NULL readings assigns every row to a fill group, then MAX(reading) within the group recovers the carried value.

The follow-up they askWhy is forward-fill a dangerous default for sensor data feeding an ML model?

Each product's share of its category revenueL4Most loops

revenue / SUM(revenue) OVER (PARTITION BY category): a windowed aggregate without ORDER BY spans the whole partition, letting you mix row-level and group-level values in one pass with no self-join. This 'ratio to total' shape generalizes to percent-of-day, percent-of-cohort, and contribution analysis.

The follow-up they askNow show each product's share AND its category's share of company revenue in the same row.

Sessionization, Gaps and Islands

The classic hard screen: turn event streams into sessions and streaksDrill 25 problems →

Sessionize a clickstream with a 30-minute inactivity timeoutL5Near-certain

Two passes: LAG the previous event time per user, flag a new session where the gap exceeds 30 minutes (or is NULL), then a running SUM of the flag gives each row its session number. This flag-then-cumulative-sum shape is the most important composite pattern in DE SQL; it also solves status changes, trip segmentation, and outage windows.

The follow-up they askMarketing wants sessions to also break on UTM source change. Where does that land in your query?

Deploy Velocity~5mFreeRun it

Find users active 3 or more consecutive days (gaps and islands)L5Most loops

Deduplicate to one row per user-day, then compute date minus ROW_NUMBER() (as days) per user: consecutive days share the same difference, which becomes the island key. GROUP BY user and island key, HAVING COUNT(*) >= 3. Be ready to say why the subtraction works: both sequences advance by one exactly when days are consecutive, so the difference stays constant within a streak and jumps at every gap. Reciting the trick without that explanation reads as memorization.

The follow-up they askRedefine 'consecutive' as business days only. What breaks, and how do you fix the island key?

Longest Gap Between Token Events~22mFreeRun it

Collapse a status-change log into intervals: how long in each state?L5Most loops

LEAD(changed_at) OVER (PARTITION BY entity ORDER BY changed_at) gives each state its end time; duration is the difference, with the final open state ending at NOW() or NULL by business choice. If the log repeats the same status, first compress runs with the islands trick so 'in state X' means contiguous time, not per-row time.

The follow-up they askThe log has duplicate timestamps for one entity. What does LEAD do, and what is your tiebreaker?

Job Status Duration~32mFreeRun it

Generate a date spine and report days with zero activityL5Differentiator

Generate the calendar (generate_series in Postgres, GENERATOR in Snowflake, a recursive CTE anywhere), LEFT JOIN the facts onto it, COALESCE counts to zero. Aggregation can only describe rows that exist, so 'show me the zeros' always means manufacturing the missing rows first. Cross join the spine with the dimension when you need zeros per user, not just per day.

The follow-up they askThe spine times the user dimension is 100M rows. When is that acceptable and when do you rethink?

First and last event per entity in one passL4Most loops

Either MIN/MAX windowed per partition for timestamps, or FIRST_VALUE/LAST_VALUE for full-row attributes. LAST_VALUE needs an explicit ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING frame, because under the default frame it just returns the current row; that gotcha is the reason this question gets asked. Two ROW_NUMBERs (ASC and DESC) filtered to 1 is the workhorse alternative.

The follow-up they askGet the first purchase amount and last purchase amount per user side by side in one row.

First and Last Timeout Per Service~12mFreeRun it

Funnels, Retention and Percentiles

Product-analytics SQL: heavily weighted at Meta-style and consumer companiesDrill 24 problems →

Compute day-7 retention by signup cohortL4Near-certain

Cohort users by signup date, LEFT JOIN activity 7 days later, then COUNT(DISTINCT active) / COUNT(DISTINCT cohort). Pin the definition before writing SQL: active exactly on day 7, active within 7 days, and active during the day-7 window are three different metrics, and the prompt is often left ambiguous deliberately to see whether you ask.

The follow-up they askWhy does a LEFT JOIN matter here when an INNER join 'runs fine'?

The Day-7 Retention Cohort~20mFreeRun it

Funnel: users who did A then B within 7 days, in orderL5Most loops

Take each user's first A (MIN or ROW_NUMBER), join B events requiring b.ts > a.ts AND b.ts <= a.ts + 7 days, count distinct users. The ordering constraint is the trap: 'did both A and B' is a different (and wrong) query. For long funnels, mention the windowed approach: MIN(CASE WHEN event = 'B' ...) OVER per user avoids one join per step.

The follow-up they askShould a user who did A twice count twice? Who decides, and how does the SQL change?

Repeat Purchase Window~14mFreeRun it

Median and p95 latency per service: exact and approximateL5Most loops

Exact: PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY latency); PERCENTILE_DISC returns an actual observed value while CONT interpolates. At scale: APPROX_PERCENTILE / APPROX_QUANTILES, t-digest under the hood, small bounded error, constant memory. On a billion rows exact p99 stops being worth its memory, and sketches have a second advantage: they merge across groups, which exact percentiles cannot.

The follow-up they askWhy are percentiles not additive across groups the way SUM is, and what does that do to your rollup tables?

DAU over MAU stickiness with a rolling 30-day distinct countL5Differentiator

The trap: COUNT(DISTINCT) cannot be windowed in most engines. The workable options are a self-join of days to a 30-day event window (correct, expensive), a date-spine cross join with EXISTS, or HLL sketches unioned across days where the platform supports merging. Only the sketch approach scales, because sketches merge and distinct sets do not.

The follow-up they askWhy CAN you window a plain COUNT but not a COUNT(DISTINCT)?

Build a histogram: how many users did 1, 2, 3+ orders?L4Most loops

Two aggregations: first count orders per user, then count users per order-count, with a CASE bucket for the 3+ tail. Double aggregation trips people who try to do it in one GROUP BY. Add the zero bucket honestly: users with no orders are not in the orders table, so the spine or a LEFT JOIN from users is required to report them.

The follow-up they askMake the buckets logarithmic (1, 2-10, 11-100) without repeating CASE logic everywhere.

Activity Histogram~10mFreeRun it

CTEs, Recursion and Semi-Structured Data

Query architecture and the JSON/array work modern warehouses run onDrill 57 problems →

CTE vs subquery vs temp table: how do you decide?L4Near-certain

CTEs for readability and reuse within one statement; temp tables when you reuse an intermediate across statements, want stats for the optimizer, or need to break a pathological plan; subqueries for one-off scalar or EXISTS checks. Add the optimization-fence note: old Postgres (pre-12) materialized CTEs and blocked predicate pushdown, modern engines mostly inline them, and MATERIALIZED is now an explicit hint.

The follow-up they askWhen would you deliberately force materialization of a CTE?

The Heaviest Carts~12mFreeRun it

Recursive CTE: walk an org chart and compute each employee's depthL5Differentiator

Anchor: the root (manager_id IS NULL) at depth 0. Recursive member: join the CTE to employees on cte.id = e.manager_id, depth + 1. Add the safety rail without being asked: a cycle in the data loops forever, so cap the depth or track the visited path. Recursion appears rarely in screens, but when it does it is a deliberate L5 filter.

The follow-up they askBuild the full path string ('CEO > VP > Manager') and explain its cost.

Extract fields from a JSON event payload and aggregate on themL5Most loops

State the dialect first, because syntax diverges: Postgres ->> and jsonb_path_query, Snowflake colon-and-dot paths with explicit casts, BigQuery JSON_VALUE. Then the architecture point: extracting the same JSON field in every downstream query is a smell. Shred hot fields into typed columns at load time and keep the raw payload for schema drift.

The follow-up they askA field is sometimes a string and sometimes an object across events. How do you load that without breaking?

UNNEST an array column and count tag frequency, without inflating other metricsL5Most loops

UNNEST (or LATERAL FLATTEN in Snowflake, EXPLODE in Spark) turns each array element into a row; count per element. The trap is doing other aggregates in the same exploded query: a 3-tag order now contributes 3x its revenue. Explode in an isolated CTE, or aggregate the rest before the explode.

The follow-up they askWhat happens to rows with an empty or NULL array, and how do you keep them?

LATERAL joins: what can they do that a plain subquery cannot?L6Senior signal

A LATERAL subquery can reference columns from preceding FROM items, so it runs per row: 'top 3 orders for each user' becomes a correlated LIMIT 3 instead of a window-plus-filter. CROSS APPLY is the same idea in SQL Server, and LATERAL is also the idiomatic way to call a table function per row. It beats the window approach when N is small, groups are huge, and an index on (user, ts) lets each probe stop after three rows.

The follow-up they askCompare the plans: LATERAL with LIMIT 3 vs ROW_NUMBER filter on a 1B-row table with an index per user.

Engine Internals and Warehouse Features

Where senior SQL rounds go after your query returns the right rowsDrill 24 problems →

Read an EXPLAIN plan: what are you actually looking for?L5Most loops

In order: scan type (full scan where you expected pruning), join strategy (broadcast vs shuffle/hash vs nested loop), and where the row-count estimates diverge from reality, because misestimates cascade into bad join orders. Name sargability: a function wrapped around the filtered column (WHERE DATE(ts) = ...) defeats index use and partition pruning; rewrite as a range predicate.

The follow-up they askThe plan says broadcast join but the job OOMs. What did the optimizer get wrong?

A join is skewed: one key holds 40% of rows. Diagnose and fixL5Senior signal

Diagnose with COUNT(*) GROUP BY key ORDER BY count DESC, or one straggler task in the Spark UI. Fixes in escalation order: broadcast the small side if it fits; isolate-and-union the hot keys; salt the hot keys with a random suffix on the big side and an exploded suffix on the small side. Mention AQE skew-join handling in Spark 3 and its limits (it splits partitions, it cannot fix a single hot key inside one row group).

The follow-up they askSalting fixed the join but doubled shuffle volume. Walk the trade and when it stops being worth it.

MERGE for incremental loads: write an idempotent upsertL5Most loops

MERGE target USING staged ON business key: WHEN MATCHED AND data changed THEN UPDATE, WHEN NOT MATCHED THEN INSERT. The property under test is idempotence: re-running the same batch must not duplicate or corrupt, which MERGE on a stable key gives you and blind INSERT does not. Postgres spells it INSERT ... ON CONFLICT DO UPDATE. Know MERGE's failure mode: duplicate keys in the staged batch make it error (or worse, nondeterministically pick), so dedup staging first.

The follow-up they askYour staged batch contains the same key twice with different values. What does MERGE do, and what should your pipeline have done?

Materialized view vs result cache vs incremental table: pick and defendL6Senior signal

Result cache is free but exact-match and falls over on any data change. Materialized views auto-refresh declaratively but constrain query shape and cost compute on every base change. Incremental tables (dbt incremental, MERGE pipelines) are the workhorse: you own freshness, cost, and backfill semantics explicitly. Decide on three axes: freshness SLA, read/write ratio, and who owns the refresh when it breaks.

The follow-up they askThe exec dashboard needs sub-second loads on data fresh to 5 minutes. Which one, and what does it cost?

Iceberg vs Delta vs Hudi, and what table formats actually solveL6Senior signal

All three bring ACID transactions, schema evolution, and time travel to files on object storage by managing snapshots of file manifests. Iceberg is the open, multi-engine spec the industry is converging on; Delta is deepest in Databricks/Spark; Hudi optimized for upsert-heavy CDC ingestion. The format war is mostly settled (everything is growing Iceberg-compatible), so the conversation that matters is the mechanics underneath: snapshot isolation, compaction, and metadata pruning, not vendor trivia.

The follow-up they askHow does time travel actually work in terms of files, and what does retaining 90 days of snapshots cost?

Where SQL screens fail people

The most common failure is not a missing function. It is coding before clarifying: one or two minutes of questions about grain, duplicates, and NULL semantics saves the five to ten minutes of debugging that sinks a timed screen. The silent killers are joins that fan out and inflate aggregates, NULL equality comparisons that drop rows, and integer division that floors a rate to zero. All three produce plausible-looking wrong answers, which is exactly why screens select for them.

Two constraint traps from real loops: some Meta rounds restrict you to ANSI SQL, so you should be able to replicate LAG, LEAD, and ROW_NUMBER with self-joins and correlated subqueries, and explain what the window function is doing underneath. And engine-specific shortcuts (QUALIFY, IFF, named UDFs) earn credit only when you flag which dialect they belong to; reaching for them in an ANSI-only setting reads as memorization.

Run it: split deploy history into bursts

The same flag-then-running-sum core as sessionization (SQL questions 21 to 24), against the warehouse's deploy_logs table. Change the 45-day threshold and re-run.

WITH gaps AS (
  SELECT svc_name, deploy_at, status,
         julianday(deploy_at) - julianday(LAG(deploy_at) OVER (
            PARTITION BY svc_name ORDER BY deploy_at
         )) AS gap_days
  FROM deploy_logs
  WHERE env_name = 'production'
),
flagged AS (
  SELECT *,
         CASE WHEN gap_days IS NULL OR gap_days > 45 THEN 1 ELSE 0 END AS is_new_burst
  FROM gaps
)
SELECT svc_name,
       SUM(is_new_burst) OVER (PARTITION BY svc_name ORDER BY deploy_at) AS burst_id,
       deploy_at,
       ROUND(gap_days, 1) AS gap_days
FROM flagged
ORDER BY svc_name, deploy_at;

Python rounds are not LeetCode

The DE Python genre is its own thing: dictionary and string manipulation, CSV and JSON parsing with error handling, dedup by key and timestamp, and streaming aggregation with generators, usually tied to the same product scenario as the rest of the round. Real screen questions from recent loops look like 'second-highest salary per department from a list of dicts' and 'read this CSV and survive the malformed rows', not dynamic programming. Grinding hard LeetCode for a DE loop is misallocated time at almost every company; the exceptions are Google and Databricks, where genuine algorithm rounds still appear.

What earns credit instead: stating complexity as you go, naming the deterministic tiebreaker in any dedup, and treating malformed input as data to route rather than an exception to crash on. None of those require harder problems, just deliberate habits.

PY · 25 of 100

Python: the 25 that screens actually use

DE Python rounds are not LeetCode-hard; they are data-wrangling rounds with dict-and-generator fluency as the bar. These 25 map to the patterns screens draw from, plus the production questions senior loops add.

On ~70% of loops · 1 round

Drill 415 Python problems

Core Data Wrangling

Dicts, Counters and key functions: the first 15 minutes of every Python screenDrill 129 problems →

Group a list of records by key in O(n)L3+Near-certain

defaultdict(list), one pass, append by key. The contrast they want: sort-then-groupby is O(n log n) and only groups adjacent runs, so the dict approach wins for unsorted input. Mention dict ordering is insertion-stable since 3.7, which often removes the need to sort at all.

The follow-up they askGroup by a composite key where one field needs normalizing (case, whitespace). Where does that logic live?

The Social Graph~10mFreeRun it

Deduplicate events keeping the latest per composite keyL4Near-certain

Dict keyed on the tuple (user_id, event_id); replace the stored record when the incoming timestamp is newer. One pass, O(n) time, O(unique keys) space. It is the Python twin of the SQL ROW_NUMBER dedup: same problem, same grain reasoning, different tool.

latest = {}
for e in events:
    key = (e["user_id"], e["event_id"])
    if key not in latest or e["ts"] > latest[key]["ts"]:
        latest[key] = e

The follow-up they askTwo events share the exact same timestamp. What is your deterministic tiebreaker?

The Original Keeper~6mFreeRun it

Flatten arbitrarily nested JSON into dot-path columnsL4Near-certain

Recurse over dict items, building the joined key path; on non-dict values, emit (path, value). The decisions are the content: lists can be exploded into rows, index-keyed (tags.0), or serialized, and each choice changes the downstream schema. Name the production failure too: a.b arriving as a scalar in one record and an object in the next.

The follow-up they askThe nesting is 200 levels deep on one malformed record. What happens to your recursion, and what is the fix?

All the Way Down~12mFreeRun it

Inner join two lists of dicts without a databaseL4Most loops

Build a dict index on the smaller list (O(m)), then stream the larger list and look up by key (O(n)): a hash join, O(n + m) total versus O(n*m) for nested loops. Building on the smaller side is the same decision a query engine makes when it broadcasts the small table.

The follow-up they askKeys repeat on both sides. What does your join multiplicity become, and is that what the user wanted?

The Record Reconciler~18mFreeRun it

Top N most frequent items, with ties handled deliberatelyL3+Most loops

Counter(items).most_common(n) is the one-liner; the depth is in the ties. most_common breaks ties by insertion order, which for your data means arbitrarily. If ties matter, sort by (-count, key) explicitly, or return everything above the Nth count, and ask which behavior the user wants, because the prompt almost never says.

The follow-up they askThe stream does not fit in memory. What do you give up to get an approximate top N?

The Dominant Signal~6mFreeRun it

Generators, Streams and Files

Constant-memory processing: the most DE-specific Python skillDrill 47 problems →

Process a 50 GB CSV on a 4 GB machineL4Near-certain

Stream it: iterate the file handle line by line (or csv.reader, which is already lazy), aggregate as you go, never materialize a list. If the consumer needs batches, write a generator that yields chunks of N rows. Memory stays O(chunk) instead of O(file); the file handle's laziness is the whole trick, and naming that is the difference between using the idiom and understanding it.

The follow-up they askNow the aggregation itself (distinct users) does not fit in memory. What are your options?

The Slow Leak~15mFreeRun it

Generators vs lists: when does lazy evaluation actually matter?L4Near-certain

A generator computes values on demand and holds one at a time; a list holds everything. Lazy wins for pipelines (read, parse, filter, write) where stages compose without intermediate materialization, and for early exit (first match in a huge file). Name the two traps: generators are single-use, and an unconsumed generator has done nothing yet, including raising its exceptions.

The follow-up they askWhy does sum(x for x in gen) work but len(x for x in gen) fail, and what does that tell you about laziness?

The Lazy Squares~10mFreeRun it

Merge K sorted iterators into one sorted streamL5Most loops

heapq.merge: lazy, O(total log K), constant memory beyond the heap of K heads. This is the merge step of external sort and exactly how log shards and sorted partition files get combined. Hand-rolling it with a heap of (value, source_index) tuples is the whiteboard version; knowing the stdlib name is the production version.

The follow-up they askOne of the K iterators is actually unsorted. What is the cheapest detection that does not buffer everything?

The Merge Champion~20mFreeRun it

Sessionize a sorted event list with a 30-minute gap ruleL5Most loops

Single pass: track previous timestamp per user, increment a session counter when the gap exceeds the threshold or the user changes. O(n) after sorting, and the same computation as the SQL LAG-plus-running-SUM version. The round usually continues into what happens when the input is not sorted, and what changes when it never ends.

The follow-up they askEvents arrive slightly out of order (up to 2 minutes). How much do you buffer, and what is that called in streaming systems?

The Repeat Visitor~20mFreeRun it

Parse a log line with regex, surviving malformed linesL4Most loops

Compile the pattern once outside the loop, use named groups so the extraction is self-documenting, and route non-matching lines to a dead-letter list with a counter instead of raising. Malformed input is data, not an exception; of the three options (raise, drop, route), silently dropping is the one that corrupts trust in everything downstream.

The follow-up they askAt what malformed-rate threshold should the job fail loudly instead of dead-lettering, and who picks that number?

The Resume Sifter~10mFreeRun it

Algorithm Screens

The five algorithm shapes DE screens borrow from SWE loopsDrill 103 problems →

Build an LRU cache with O(1) get and putL5Most loops

OrderedDict: move_to_end on get, popitem(last=False) on eviction. Or dict plus doubly linked list if they want the internals. Connect it to the job: this is exactly what functools.lru_cache wraps, and caching expensive lookups (schema fetches, dimension rows) is everyday pipeline work. State the O(1) claim and where it comes from (hash map plus list surgery).

The follow-up they askMake it thread-safe. What lock granularity, and what does that do to your O(1)?

The Throttle Ceiling~15mFreeRun it

Merge overlapping intervals (sessions, outages, bookings)L4Most loops

Sort by start, sweep once: extend the current interval when the next start is within the current end, else emit and restart. O(n log n) from the sort. It recurs in DE work as collapsing outage windows, merging booking ranges, and compacting time-partitioned files, which is why it is the most common 'real algorithm' in DE screens.

The follow-up they askIntervals stream in unsorted and unbounded. What changes, and what guarantee do you lose?

The Overlap~5mFreeRun it

Sliding window: max sum (or max value) over any K consecutive elementsL5Most loops

Running sum: add the entering element, subtract the leaving one, O(n). For max value per window, a monotonic deque keeps candidates in decreasing order, evicting dominated values, amortized O(n). Both are the in-memory twin of a SQL window frame.

The follow-up they askWindow by 30 minutes of event time instead of K elements. What structure changes?

The Window Cleaner~12mFreeRun it

Topological sort: order pipeline tasks by their dependenciesL5Differentiator

Kahn's algorithm: compute in-degrees, queue zero-degree nodes, pop and decrement dependents, appending newly-zero nodes. If you finish with fewer nodes than you started with, there is a cycle: that is your DAG validation. Airflow's scheduler is this algorithm, which is why it keeps appearing in staff-level loops.

The follow-up they askTwo valid orderings exist. How do you make the sort deterministic, and why would a scheduler care?

The Dependency Resolver~20mFreeRun it

Streaming median (or p95) as numbers keep arrivingL5Differentiator

Two heaps: a max-heap for the lower half, min-heap for the upper, rebalanced so sizes differ by at most one; median reads off the tops in O(1), inserts are O(log n). For percentiles at real scale the answer changes shape: t-digest or P2, accepting bounded error for constant memory, which is the same trade APPROX_PERCENTILE makes in the warehouse.

The follow-up they askWhy is exact streaming p99 fundamentally memory-hard, while exact streaming max is trivial?

Holding the Center~25mFreeRun it

pandas and Working at Scale

DataFrame fluency plus the judgment of when pandas stops being the toolDrill this pattern →

Implement an SCD Type 2 merge in pandasL5Most loops

Outer-merge source and target on the natural key with indicator=True, compare tracked columns to classify unchanged, changed, and new. Expire changed rows (set valid_to, is_current False) and append new versions. Point-in-time correctness hinges on half-open intervals: the new valid_from equals the old valid_to exactly, so an as-of join can never double-match a moment in time.

The follow-up they askThe same key changes twice in one batch. What does your merge do, and what should it do?

The Customer Who Changed~25mFreeRun it

pivot_table vs groupby-unstack, and the fill_value questionL4Most loops

pd.pivot_table(df, index, columns, values, aggfunc, fill_value=0) is groupby plus unstack with the rough edges sanded off: it handles duplicate index pairs by aggregating (that is the aggfunc) where pivot() would raise. fill_value matters because missing combinations become NaN, which silently poisons downstream sums and type expectations.

The follow-up they askIs a missing combination a 0 or a NaN in this business? Why does the answer change per metric?

merge_asof: what is a point-in-time join and why does ML care?L5Differentiator

merge_asof joins each left row to the most recent right row at or before its timestamp: 'the price as of the trade', 'the feature as of the label'. It is the leakage guard: a plain join on nearest timestamp happily reads the future and trains a model that cannot be reproduced in production. Sorted inputs are required; the direction and tolerance parameters are the knobs to mention.

The follow-up they askImplement the same as-of semantics in SQL on a warehouse. What pattern do you reach for?

A groupby blows past memory. Walk your options in orderL5Most loops

First make pandas smaller: categoricals for low-cardinality strings, downcast numerics, load only needed columns. Then chunk: partial aggregates per chunk, combine at the end (works because SUM and COUNT compose; medians do not). Then change engines: DuckDB or Polars on the same machine handles 10-50x pandas' practical limit, and Spark comes last, when one machine is truly done. Jumping straight to Spark skips two stages that solve most cases on one machine at a hundredth of the cost.

The follow-up they askWhich aggregations can be computed chunk-wise and recombined exactly, and which cannot? Why?

Why is iterating DataFrame rows slow, and what do you do instead?L4Most loops

iterrows materializes a Series per row and pays Python interpreter cost per element; vectorized operations run the loop in C over contiguous arrays, routinely 100-1000x faster. Escalation: built-in vectorized ops, then np.where/np.select for conditionals, then .map for dict lookups, and .apply last because it is the same Python loop wearing a costume.

The follow-up they askYou truly need per-row logic calling an external service. What is the right architecture instead of apply?

Production Python

The questions that separate scripters from engineers in senior loopsDrill 13 problems →

Fetch 10,000 URLs with a concurrency cap of 20L5Most loops

asyncio with a Semaphore(20) wrapping each fetch, gathered together: I/O-bound work wants event-loop concurrency, not threads-per-request. Explain WHY the cap exists (you are protecting the downstream API and your own sockets, not your CPU), and add per-request timeouts and retry-with-backoff or the first hung connection stalls the batch.

The follow-up they askThe API returns 429 with a Retry-After header. How does your semaphore design change?

The Throttle Ceiling~15mFreeRun it

Threading vs multiprocessing vs asyncio: place each correctlyL5Most loops

CPU-bound: multiprocessing (the GIL serializes Python bytecode in threads). I/O-bound with many connections: asyncio. I/O-bound with blocking libraries you cannot rewrite: threads. Two footnotes keep the answer current: C extensions (numpy, parquet readers) release the GIL, so threads do help there, and Python 3.13's free-threaded build is changing this calculus but is not yet the production default.

The follow-up they askWhy does multiprocessing pickle everything, and what breaks when your rows contain open connections?

Design a retry decorator with exponential backoff and jitterL5Differentiator

Wrap the call in a loop: on listed-as-transient exceptions, sleep base * 2^attempt plus random jitter, re-raise after max attempts with the original exception chained (raise ... from e). Jitter is load-bearing, not decoration: without it every failed worker retries in lockstep and you DDoS your own dependency. And only idempotent operations get retried at all; that constraint belongs in your first sentence, not your last.

The follow-up they askThe operation is a payment post. What do you do instead of retrying?

Execution Timer Wrapper~10mFreeRun it

Context managers: where do they earn their place in pipelines?L5Most loops

Anything with a teardown that must run on failure: connections, transactions, file handles, locks, temp directories. @contextmanager with try/finally around the yield is the five-line version. The pipeline-specific example worth giving: a transaction context that commits on clean exit and rolls back on exception, so partial loads cannot leak into the warehouse.

The follow-up they askWhat does __exit__ returning True do, and when is swallowing the exception correct?

The Exception Handler~15mFreeRun it

Design the error handling for a 10M-row nightly ETLL5Differentiator

Classify failures: malformed rows go to a dead-letter table with the raw payload and reason; transient infra errors retry with backoff; logic errors fail the run loudly. Wrap the load in a transaction or write-then-swap so consumers never see partial data, emit row-count and dead-letter-rate metrics, and alert on thresholds. The stance underneath: errors are an expected input class with their own data path, not an exception path.

The follow-up they askDead-letter rate jumped from 0.1% to 4% overnight but the pipeline is green. What does your system do?

The Safe Caster~8mFreeRun it

The round that cuts more seniors than any other

Practitioner consensus is blunt: data modeling is the most important skill after SQL, and the most underprepped. Candidates clear two SQL rounds, then fail a modeling hour by drawing boxes before declaring the grain, hand-waving slowly changing dimensions, or normalizing everything with no awareness of who consumes the tables. The round measures judgment, not notation: good answers start with 'one row represents exactly one X' and defend each denormalization as a choice with a cost.

It is also the thinnest part of most question banks, which is why this page weights it at twenty questions and links each one into runnable schema-design problems. If you are interviewing at L5 or above, the highest-leverage hours you have left are here, not in another window-function drill.

DM · 20 of 100

Data modeling: the round that decides senior loops

Candidates pass two SQL rounds and then fail here, because modeling rounds measure judgment, not syntax. 20 questions across 4 pattern families, built around the habit every good answer starts with: state the grain first.

On ~55% of loops · decides senior offers

Drill 67 modeling problems

Grain and Fact Tables

The first five minutes of every modeling round, and where most failDrill 45 problems →

What is the grain of a fact table, and why does it come first?L4+Near-certain

The grain is the sentence 'one row represents exactly one X': one order line, one daily account snapshot, one impression. It comes first because every other decision (which dimensions can attach, which measures are additive, the table's size) is downstream of it. Interviewers at modeling-heavy companies fail candidates who start drawing boxes before saying this sentence.

The follow-up they askThe PM says 'one row per order' but order items have different ship dates. What broke, and what do you ask?

Marketplace Sales Warehouse~40mFreeRun it

Design a star schema for an e-commerce checkout, out loudL4Near-certain

Declare the grain (one row per order line item), then the fact (quantities, amounts, FKs only), then dimensions: product, customer, date, promotion, each denormalized. Narrate the choices: why line item and not order (item-level margin and product analysis), why date is its own dimension (fiscal calendars, holiday flags), what stays out of the fact (descriptions, anything not additive or an FK).

The follow-up they askNow add returns. New fact table or status flag on the sales fact? Defend it.

The Vanishing State~10mRun it

Transaction vs periodic snapshot vs accumulating snapshot factsL5Most loops

Transaction facts record events as they happen (one row per order). Periodic snapshots record state at intervals (account balance per day) for levels you cannot reconstruct by summing events cheaply. Accumulating snapshots track one row per process instance with milestone timestamps that update in place (order placed, shipped, delivered) for funnel and lag analysis. Pick by the question the business asks, and name a real example for each.

The follow-up they askPipeline-wise, why are accumulating snapshots the most painful of the three to maintain?

Additive vs semi-additive vs non-additive measuresL4Most loops

Additive sums across every dimension (revenue). Semi-additive sums across some but not time (account balance: sum across accounts, never across days; you take last-value or average over time). Non-additive cannot sum at all (ratios, percentages: store the numerator and denominator instead and derive at query time). Getting balances wrong is the classic BI bug, which is why this is asked.

The follow-up they askA dashboard sums daily-snapshot balances over a month and the number looks plausible. How do you catch this class of bug systematically?

Ghosts in the Ledger~15mFreeRun it

A fan trap: two one-to-many joins inflate your measures. Recognize and fixL5Differentiator

Fact at order grain joined to both shipments and payments (each many-per-order) cross-multiplies rows: 3 shipments times 2 payments is 6 rows, and every measure double-counts. The fix is structural, not a DISTINCT: aggregate each many-side to the shared grain before joining, or model them as separate fact tables queried through conformed dimensions (drill-across). This is also why a DISTINCT inside a revenue query is a smell: it usually marks a grain mismatch someone papered over.

The follow-up they askWhere do chasm traps differ from fan traps, and which one do BI tools partially protect you from?

Two Wallets~35mRun it

Dimensions and Slowly Changing Data

SCD strategy is the single most-asked modeling topic in DE loopsDrill this pattern →

SCD Types 1, 2, 3: mechanics and how you chooseL4+Near-certain

Type 1 overwrites (no history). Type 2 versions rows with valid_from/valid_to and is_current (full history, point-in-time joins). Type 3 keeps a previous-value column (exactly one change, rarely right). Choose by the question the business will ask: 'what is it now' is Type 1, 'what was it when the fact happened' is Type 2. Be honest about cost: Type 2 multiplies row count, and every consumer join picks up a temporal predicate that someone will eventually get wrong.

The follow-up they askWhich attributes of a customer dimension deserve Type 2, and which are fine as Type 1, in one real example?

The Customer Who Changed~25mFreeRun it

Implement SCD Type 2: columns, keys, and the point-in-time joinL5Near-certain

Surrogate key per version, natural key, valid_from, valid_to (half-open: NULL or 9999-12-31 for current), is_current flag. Facts store the surrogate key captured at load time, so historical joins are exact without temporal predicates; alternatively join on natural key with fact_date >= valid_from AND fact_date < valid_to. Half-open intervals (new valid_from equals old valid_to) prevent the double-match bug.

The follow-up they askA source row changed twice between your daily loads. What does your dimension miss, and does it matter?

Surrogate keys vs natural keys: why warehouses default to surrogatesL4Most loops

Surrogates insulate you from source-system key reuse and type changes, enable SCD Type 2 (the natural key repeats across versions, something must distinguish them), and make fact-dimension joins compact integers. Keep the natural key as an attribute with an index for lookups. The crisp version: surrogates are about ownership; your warehouse controls its keys instead of inheriting another system's decisions.

The follow-up they askHash-based surrogate keys vs sequence-based in a distributed loader: trade-offs?

Where Everyone Was~20mFreeRun it

Late-arriving dimensions: the fact landed before its dimension rowL5Differentiator

Insert an inferred member: a placeholder dimension row with the natural key, defaults elsewhere, flagged inferred. Facts join to it normally; when the real attributes arrive, update the placeholder in place (Type 1 on the inferred row) or version it properly. The alternatives both fail quietly: dropping the fact corrupts totals, holding it in limbo delays them. The same problem reappears as a follow-up inside pipeline rounds.

The follow-up they askSame problem, but the dimension arrives with a correction to history. Now what?

The No-Show~10mRun it

Daily snapshots vs SCD Type 2: when is the dumb option right?L5Differentiator

Snapshotting the whole dimension daily costs storage but makes every point-in-time query trivial (WHERE snapshot_date = X), parallelizes loads, and survives bad upstream data by reload. SCD2 is compact and precise but every consumer must get temporal joins right. With object storage as cheap as it is, snapshot-everything is increasingly the pragmatic default; SCD2 earns its complexity on huge dimensions or sub-daily change tracking. There is no clean winner, only a choice per dimension with a stated cost.

The follow-up they askAt a billion-row customer dimension, daily snapshots cost what per year on S3, roughly? Walk the napkin math.

Relationship Patterns and Special Dimensions

Bridges, role-playing, conformed and junk dims: the vocabulary checkDrill 22 problems →

Model a many-to-many: patients with multiple diagnoses per visitL5Most loops

A bridge table between the fact and the dimension group: visit to diagnosis-group, group to diagnoses, optionally with an allocation weight so summed measures do not multiply. Then have the weight conversation: allocate 1/n evenly, weight by a primary-diagnosis flag, or report deliberately overlapping totals. Each is a business decision someone has to own; the bridge table just makes the choice visible instead of accidental.

The follow-up they askRevenue summed through your bridge exceeds actual revenue. Which choice caused that, and is it wrong?

Personal Best~15mFreeRun it

Role-playing dimensions: one date dimension, three meaningsL4Most loops

An order fact carries order_date, ship_date, and delivery_date, all FKs to the single physical date dimension, each join aliased as its role. One physical table, multiple semantic roles: no duplication, consistent calendar attributes everywhere. Views per role keep BI-tool joins sane. The same pattern covers from/to airports and buyer/seller users.

The follow-up they askYour BI tool keeps collapsing the three date joins into one. What structural fix do you ship?

Conformed dimensions: what makes cross-mart analysis possible?L5Most loops

The same dimension (same keys, same attributes, same hierarchy) shared across fact tables, so 'revenue by customer' from sales and 'tickets by customer' from support can be compared in one query (drill-across). Without conformance, each mart invents its own customer and cross-domain questions become reconciliation projects. Conformance is 80% organizational discipline (one owner per dimension) and 20% modeling, which is why it fails at companies with strong modelers and no ownership.

The follow-up they askTwo business units genuinely disagree on what a 'customer' is. What do you build?

POS Sales Data Warehouse~25mRun it

Junk and degenerate dimensions: where do flags and order numbers live?L5Differentiator

Low-cardinality flags (gift wrap, channel, payment type) get combined into one junk dimension holding the distinct combinations, instead of five tiny dimension tables or fact-bloating columns. The order number itself is a degenerate dimension: it stays in the fact with no dimension table, because it has no attributes but you need it for grouping and tracing back to source. Both terms are pure Kimball vocabulary; the question exists because people who have shipped a real warehouse know them and people who have only read about warehouses usually do not.

The follow-up they askWhen does a junk dimension's combination count blow up, and what do you do then?

Model a hierarchy (org chart, category tree) for analyticsL5Differentiator

Parent-child (adjacency) is natural but needs recursion per query. For warehouses: flatten fixed-depth hierarchies into columns (level_1 to level_n), or build a closure/bridge table of every ancestor-descendant pair with depth for variable-depth rollups without recursion. Trade write-time explosion for read-time simplicity; in a read-heavy warehouse that trade is usually right.

The follow-up they askThe org chart is SCD2 on top of being a hierarchy. What does 'rollup as of last March' require?

Telecom Network Connectivity Warehouse~35mFreeRun it

Warehouse-Scale and Lakehouse Decisions

Where modeling meets infrastructure: the L6 conversationDrill this pattern →

Star schema vs One Big Table: the 2026 version of the debateL5Most loops

OBT (pre-joined wide tables) wins for dashboard latency, ML feature consumption, and column-store engines that make width nearly free to scan. Stars win for governed reuse: conformed dimensions update in one place, while OBT denormalizes them into every table and a customer-name fix becomes a backfill campaign. The working synthesis: model in stars (dbt staging and marts), generate OBTs for the hot read paths, and never let the OBT become the source of truth.

The follow-up they askWhich specific query shapes get slower on OBT, not faster?

Metric Definition Reverse Engineering~35mFreeRun it

Partitioning and clustering strategy for a multi-TB fact tableL5Most loops

Partition by the column most queries filter on, almost always event date, at a granularity that keeps partitions in the hundreds-of-MB-to-GB range. Then cluster or Z-order within partitions by the next most-filtered columns (tenant, customer) for file-level pruning. Anti-patterns to name: partitioning by high-cardinality keys (millions of tiny partitions, small-files death) and partitioning by something queries never filter on.

The follow-up they askQueries filter on user_id but you partitioned by date. What are your options without rewriting history?

Medallion architecture: what does each layer actually promise?L5Most loops

Bronze: raw, immutable, append-only, schema-on-read, your replay buffer. Silver: cleaned, deduplicated, conformed entities with enforced schemas. Gold: business-ready marts and metrics. The promises that matter: bronze means you can rebuild everything downstream from scratch, silver means consumers stop re-cleaning, gold means metric definitions live in one place. Call out the failure mode: layers without contracts just triple your storage.

The follow-up they askA bug corrupted silver for the last 30 days. Walk the actual replay, including the consumers.

Kimball vs Data Vault vs wide-table pragmatism: pick for a real orgL6Senior signal

Kimball stars remain the default for consumption: analysts understand them and BI tools assume them. Data Vault (hubs, links, satellites) earns its complexity in regulated, multi-source environments needing full auditability and parallel loading, at the cost of query-time assembly. The honest 2026 answer: most teams run Kimball-shaped marts on a lakehouse with dbt, vault-like raw history in bronze, and OBT serving layers, and the interview is checking you choose by org shape (team size, sources, audit burden), not by ideology.

The follow-up they askEight engineers, forty sources, heavy compliance. Which layers get which methodology, concretely?

Schema evolution as a contract problem: who may change what, and how?L6Senior signal

Additive changes (new nullable columns) flow freely; breaking changes (renames, type narrowing, dropped columns) require a contract: schema registry with compatibility rules in streaming, dbt model contracts or versioned views in the warehouse. The mechanism matters less than the principle: producers cannot unilaterally break consumers, and the check runs in CI, not in an incident retro. Tie it to table formats: Iceberg/Delta make column adds metadata-only, which is what makes additive-by-default cheap.

The follow-up they askA producer must rename a column for legal reasons. Design the migration with zero consumer downtime.

Design rounds grade trade-offs, not architecture diagrams

The most commonly cited failure in senior design rounds is a textbook architecture recited without trade-offs. What passes is decision-making out loud: cost versus latency versus correctness on every component, and failure modes raised before the interviewer asks about them. Retries, idempotency, dead-letter queues, backfill, and alerting belong in your first pass, not in your answers to follow-ups; candidates who wait to be asked read as engineers who wait for incidents.

Streaming depth concentrates by company type: fintech loops press on exactly-once semantics and reconciliation, rideshare and delivery loops press on watermarks and late events, and batch-first companies mostly want the batch-versus-streaming trade-off argued well. The 10 prompts here cover the recurring shapes; rehearse them as spoken walkthroughs, not as diagrams you silently draw.

SYS · 10 of 100

Pipeline and system design: 10 prompts that keep recurring

DE system design is not SWE system design: it weighs idempotency, backfills, late data, and cost, not QPS gymnastics. These 10 are the prompts and concept checks interview reports repeat, each with the failure mode that sinks candidates.

On ~65% of loops · heavier at L5+

Drill 144 pipeline problems

Batch Pipelines and Idempotency

The default opener: design a daily ETL that survives re-runsDrill 79 problems →

Design a daily pipeline from Postgres to the warehouseL4+Near-certain

Extract incrementally (updated_at watermark or CDC), land raw and immutable, transform with deterministic logic stamped by run_id, load via MERGE on business keys. The property everything hangs on is idempotence: re-running any day produces identical results, because retries and backfills are not exceptions, they are Tuesdays. Close with orchestration (Airflow/Dagster), alerting on volume anomalies, and a stated freshness SLA.

The follow-up they askThe source has no updated_at and you cannot install CDC. Rank your remaining options by ugliness.

Two Hundred Million Redirects~25mFreeRun it

Design backfill: reprocess 2 years of data without breaking todayL5Near-certain

Backfill is a first-class mode, not a hack: parameterize every job by logical date, write to partition-scoped outputs so each day overwrites only itself (or write to a shadow table and swap atomically), and throttle parallelism so the backfill does not starve the daily run. The senior additions: version the transform code so you can say which logic produced which partition, and decide upfront whether downstream consumers see partial backfill state.

The follow-up they askMid-backfill, you find the new logic is also wrong. What does your design let you do?

CDC vs dual writes: why is writing to two systems a trap?L5Most loops

Dual writes (app writes DB and Kafka) cannot be atomic without distributed transactions: one write succeeds, the other fails, and the systems silently diverge forever. CDC reads the database's own log (Debezium on the WAL), making the DB the single source of truth and the stream a faithful derivative, including deletes. The outbox pattern is the middle path to name: one transactional write, relayed asynchronously.

The follow-up they askCDC delivers a schema change event mid-day. What happens downstream in your design?

Who Is Churning and Why~20mRun it

Data quality gates and dead-letter queues: where do bad rows go?L5Most loops

Three tiers: schema violations dead-letter with raw payload and reason; statistical anomalies (volume, null-rate, distribution drift) warn or block by severity; business-rule failures route to owners. Two constraints define the design: a poison row must never block the partition, and it must never silently vanish. Dead-letter rate is itself a monitored metric with an alert threshold.

The follow-up they askBlocking on quality failures versus shipping flagged data: who should make that call per table, and where is it encoded?

What Everyone Is Watching~30mFreeRun it

Streaming, Late Data and Delivery Semantics

Event-time reasoning: the sharpest filter in streaming-flavored roundsDrill 59 problems →

Design real-time clickstream ingestion at 200K events/secL5Near-certain

Kafka partitioned by user_id (ordering per user where it matters), a stream processor (Flink) for sessionization and enrichment with checkpointed state, sink to object storage for the lake plus an OLAP store for live dashboards. Then walk the three failure modes before anyone asks: consumer lag (scale partitions and consumers), hot keys (one power user skews a partition), and replay (offsets plus idempotent sinks make reprocessing safe).

The follow-up they askProduct wants exactly-once into the OLAP store. What does that actually require end to end?

What Everyone Is Watching~30mFreeRun it

Event time vs processing time, watermarks, and where late events goL5Near-certain

Event time is when it happened; processing time is when you saw it; mobile clients make the gap minutes to days. Windows must close on event time, so the system tracks a watermark ('I believe I have seen everything up to T') as a heuristic. Events later than the watermark get a declared policy: allowed lateness updates the window, beyond that they route to a side output or a correction job. The fatal interview answer is windowing on processing time and calling it done.

The follow-up they askYour watermark is too aggressive and 2% of conversions land late. Who notices, and what is the correction path?

Exactly-once: what is real, what is marketing, and what do you build?L5Most loops

Delivery is at-least-once in any real distributed system; 'exactly-once' means exactly-once PROCESSING EFFECTS, achieved by pairing at-least-once delivery with idempotent or transactional sinks (Flink two-phase commit into Kafka, MERGE on event_id into the warehouse, dedup tables at the edge). The accurate term is effectively-once: duplicates happen, their effects do not.

The follow-up they askYour sink is an external API with no idempotency support. Now what?

Spark and Cost at Scale

The distributed-compute checks that survive in 2026 loopsDrill 6 problems →

Why does groupByKey OOM where reduceByKey survives?L4+Most loops

groupByKey ships every value across the network and materializes whole groups per key; reduceByKey (and DataFrame aggregations) combine map-side first, shuffling only partial aggregates. One hot key with 100M values is a dead executor under groupByKey and a non-event under combining aggregation. In DataFrame terms Catalyst handles this, which is itself the answer: use the declarative API and let the optimizer do map-side combine.

The follow-up they askWhen do you genuinely need the full group materialized, and how do you survive it?

A 4-hour Spark job used to take 40 minutes. Diagnose it liveL5Most loops

Open the Spark UI before touching code: find the slow stage, then read its task-time distribution. One straggler task means skew (find the hot key, salt or isolate it). Uniform slowness means data growth, spill to disk (check shuffle spill metrics, raise partitions or memory), or a join that flipped from broadcast to sort-merge when a table crossed the autoBroadcast threshold. Narrating plan-reading (EXPLAIN, AQE decisions) instead of guessing config flags is what passes.

The follow-up they askSame job, but it is a SQL warehouse query now. Which parts of the diagnosis transfer?

Your platform bill doubled. Design the cost-engineering responseL6Senior signal

Measure first: per-job, per-table, per-team attribution (query tags, cluster labels), because untagged spend is unfixable spend. The usual offenders in order: small files (compaction), full scans from missing partition filters (pruning audits), oversized always-on clusters (right-size, auto-stop), hot storage holding cold data (tiering policies), and dashboards refreshing unwatched (cache or kill). Frame it as an SLO trade: cost is a feature you budget like latency, and saying 'spend X to keep freshness Y' is the staff-level sentence.

The follow-up they askWhich single metric would you put on the platform team's wall to keep cost from regressing?

What strong-hire looks like in the debrief

Hiring committees average across rounds, so consistent strength everywhere beats one spectacular hour with a weak one attached; a single bad round drags the whole packet at committee-driven companies. The signals that recur in strong-hire writeups are specific: quantified impact ('cut the job from four hours to forty-five minutes'), edge cases raised before anyone asks, and driving the conversation rather than being towed through it. Behavioral answers are where ties break, and at Amazon they are scored in every single round, not just the designated one.

BEH · 5 of 100

Behavioral: 5 stories you walk in with

Senior offers are won and lost here more often than candidates believe: when two people clear the technical bar, the debrief argues about these answers, not the SQL. 5 evergreen prompts and what each one is actually measuring.

On every loop · the L5+ tiebreaker

Run a mock interview

The Five Stories

Prepare these with real numbers; everything else is a remixDrill this pattern →

Tell me about a project with measurable impactAllNear-certain

STAR with numbers an engineer would respect: latency from X to Y, dollars saved, teams unblocked, incidents prevented. At L5+ the difference is the postmortem you add without being asked: what you would do differently, stated specifically. Practice the 90-second version; rambling sinks more behavioral rounds than weak content does.

The follow-up they askWhat would you do differently if you started it today?

Tell me about a disagreement with a stakeholder or engineerAllNear-certain

Pick a real disagreement with stakes, show you argued with data, listened, and either changed their mind or changed yours; both endings work. What is being measured is ego under pressure. 'I have never really had a conflict' reads as either dishonest or unseasoned, and both get marked down.

The follow-up they askTell me about a time you were wrong and how you found out.

Tell me about a real failure with real consequencesL5+Near-certain

A pipeline you broke, data you corrupted, an estimate you blew, with the actual cost named. Then root cause, the process change that outlived you, and what you would tell someone in the same setup today. Polished faux failures ('I worked too hard') get marked down immediately; a specific blast radius plus a fix that outlived your tenure is what a convincing answer looks like.

The follow-up they askWho did you tell first, and how fast?

Tell me about delivering through ambiguityL5+Most loops

Show the moves, not the buzzwords: the assumptions you wrote down, the cheapest experiment that de-risked the big unknown, the decision you committed to before certainty existed, and how you communicated the risk. 'We did agile' is a non-answer; what convinces is a concrete decision made under uncertainty and the reasoning you can still reconstruct.

The follow-up they askWhich assumption turned out wrong, and what did unwinding it cost?

Tell me about influencing without authority, or growing someoneL5+Most loops

A migration you championed past skeptics, a standard you spread, an engineer you mentored to a promotion. Name who pushed back, what their legitimate concern was, and how you brought them along rather than rolling them over. End six months later: durable outcomes are the evidence; the moment of winning the argument is not.

The follow-up they askWhat did the skeptics get right?

How to split your prep time

100 questions do not deserve 100 equal slices of your time. The ladder below ranks where an hour of prep buys the most, combining how often each domain appears, how often it decides the loop, and how trainable it is in weeks. The most common misallocation in the wild is inverted Spark weighting: question banks overflow with Spark internals, but outside Databricks and Netflix-scale platform companies, most loops contain no Spark round at all. Know the conceptual layer (shuffle, partitioning, broadcast versus sort-merge) and spend the recovered hours on modeling.

Where an hour of prep pays most

Ranked by frequency times decision-weight times trainability. Percentages are share of total prep hours; senior candidates shift SQL hours into design.

01SQL patterns under time pressure35-40% of prep hoursDrill SQL →

The screen is the gate: you do not reach the onsite without fluent ranking, LAG, running totals, and dedup. Drill timed, speaking aloud, and include one ANSI-only pass where you replicate windows with self-joins.

02Data modeling practice20-25% of prep hoursDrill modeling →

Highest senior kill rate, lowest supply of good prep material. Pick a product, declare the grain, draw facts and dimensions, defend SCD choices, map metrics to the model. Repeat with a new product daily.

03DE-genre Python~15% of prep hoursDrill Python →

Dict, file, and generator patterns at an easy-medium ceiling. Hard LeetCode is explicitly wasted effort for DE loops outside Google and Databricks.

04Pipeline design talk-tracks~15%, raise to ~25% at L5+Drill pipelines →

Rehearse the recurring prompts as spoken walkthroughs. Lead every design with idempotency, late data, backfill, dead-letter queues, and monitoring before being asked.

05Behavioral stories with numbers~10%, double for AmazonRun a mock →

Six to eight quantified STAR stories covering impact, conflict, failure, ambiguity, and influence. At Amazon they are scored in every round and sink technically strong candidates.

06Spark internals0-10%, conditionalSpark prep →

Load-bearing only at Databricks, Netflix-scale platforms, or where the job description says Spark. Everywhere else, the conceptual layer is enough and the hours belong to modeling.

Calibrate to the company archetype

The same 100 questions get weighted very differently by company. Meta runs blended hours where one product scenario flows from metrics to data model to SQL to Python (so practice switching representations on a single scenario, not skills in isolation) and restricts some rounds to ANSI SQL. Amazon scores Leadership Principles in every round and expects six to eight STAR stories. Stripe's signature is a take-home plus debugging focus with fintech correctness framing. Databricks is the one mainstream loop where Spark internals genuinely decide outcomes. Mid-market loops compress to two or three rounds, often add a take-home, and care more about why you chose dbt or Airflow than about internals. The guides below break each archetype down.

Company-specific loop breakdowns

FAANG Data Engineer Questions→

FAANG-tagged variants of this bank, with loop formats per company.

Amazon Data Engineer Questions→

LP-scored rounds, the Bar Raiser, and the STAR stories that pass them.

Stripe Interview Guide→

Take-home plus debugging focus, exactly-once and reconciliation framing.

Databricks Interview Guide→

The loop where Spark internals, AQE, and Delta log mechanics decide.

Netflix Interview Guide→

Senior-heavy loops: architecture depth and cost-correctness trades.

Uber Interview Guide→

Trips-and-payments modeling, SCDs, and streaming consumers.

Sequence the prep the way the loop runs

Screens come before onsites, so SQL fluency pays first and modeling depth pays last. The stages below are ordered by when each skill gets tested, with rough hours attached; compress or stretch them to your interview date. One constraint does not move: short daily sessions beat marathons, because the skill being tested is recall at speed and it decays in days.

The prep sequence

01
SQL to screen strength
Work the 8 SQL pattern groups in order, answering aloud before reading. After each group, do three timed problems from it. You are done when ranking, LAG, and dedup queries come out without pausing on syntax.
- ▸Run every query, do not just read it: edge cases live in the output
- ▸Do one ANSI-only session: replicate ROW_NUMBER and LAG with self-joins
02
Modeling practice until grain is reflexive
One product per session: declare the grain, draw the star, defend the SCD choices, map five metrics to the model. Work the 20 modeling questions alongside and drill the linked schema-design problems.
- ▸Start every practice round with 'one row represents exactly one X'
- ▸Use a whiteboard or blank doc: the medium changes your speed
03
Python genre and design talk-tracks
Alternate sessions: DE-genre Python (dedup, flatten, sessionize, generators) and spoken walkthroughs of the 10 design prompts. Record one walkthrough and listen back: the gap between what you know and what you say is usually a surprise.
- ▸Open every design answer with idempotency and late data; do not wait to be asked
- ▸Skip hard LeetCode unless your loop is Google or Databricks
04
Mocks, stories, company calibration
Two full mock interviews under time pressure, six to eight STAR stories written with real numbers, and one session on your target company: blended scenarios for Meta, LP framing for Amazon, take-home rehearsal for Stripe-style loops.
- ▸Practice the 90-second version of every behavioral story
- ▸Re-run your five weakest SQL and modeling patterns last

Turn reading into practice

Recognizing the answers above is the first pass; producing them under a timer is what the loop measures. Every pattern group links into the practice catalog, where problems run in your browser against real datasets with graded feedback. When the patterns feel automatic, take a timed complete data engineer interview preparation framework pass through each round: how to pass the SQL round, how to pass the Python round, how to pass the data modeling round, how to pass the system design round and how to pass the behavioral round, then pressure-test the whole loop with a mock interview.

Prepare for the interview

01 / Open invite

02min.

Know the patterns before the interviewer asks them.

a SQL query, the same shape a screen would give you.

The diff against expected. Where ties broke. What you missed.

sandbox

1SELECT user_id,

2 COUNT(*) AS sessions

3FROM events

4WHERE ts >= NOW() - INTERVAL '7 day'

Execute your solution0.4s avg.

MicrosoftInterview question

Solve a problem

Run all 100 in the practice harness

01
Active recall beats re-reading by 50%
Cognitive-science meta-reviews (Dunlosky et al., 2013) rank practice testing as a top-tier study technique, while re-reading and highlighting rank near the bottom
02
76% of hiring managers reject on the coding task, not the resume
From HackerRank's 2024 Developer Skills Report. Candidates who look strong on paper still fail the live screen if they haven't done timed, executable practice
03
Five problem shapes cover 80% of data engineer loops
Dedup, sessionization, top-N-per-group, slowly-changing dimensions, partition tricks. Writing the shapes by hand turns the unfamiliar into pattern recognition

Open the problems

More data engineer interview prep

Top 50 Data Engineer Interview Questions→

The time-pressed subset of this bank.

Free Data Engineer Interview Problem Set→

The full bank, open-source and runnable after sign-in.

Complete Data Engineer Interview Prep Framework→

The pillar guide covering every round in the loop end to end.

More data engineer interview prep guides

free Data Engineer interview questions and answers→

Free bank of 100+ data engineer interview questions and answers, runnable in-browser or open-source on GitHub. Updated 2026.

the curated 50 Data Engineer interview questions→

The 50 most frequently asked data engineer interview questions, with worked answers.

FAANG Data Engineer interview questions and answers→

Real questions from Meta, Amazon, Apple, Netflix, and Google Data Engineer loops, with answers.

real Data Engineer take-home assignment examples→

Real take-home prompts from Stripe, Airbnb, Databricks, with annotated example solutions.

how to pass the SQL round→

Window functions, gap-and-island, and the patterns interviewers test in 95% of Data Engineer loops.

how to pass the Python round→

JSON flattening, sessionization, and vanilla-Python data wrangling in the Data Engineer coding round.