50 Data Engineer Interview Questions

The 50 most frequently asked data engineer interview questions in 2026, with worked answers. Selection criterion: each question appears in at least eight reported interview loops in our dataset of 1,042 reports collected from 2024 to 2026. The list spans SQL (20), Python (12), data modeling (10), system design (5), and behavioral (3). Pair this with the round-specific deep guides in the our data engineer interview prep hub.

20 SQL Questions

SQL is 95% of Data Engineer loops. Drill these until medium problems take 12 minutes and hard problems take 20.

Q1 · L3

Find duplicate rows in a table

GROUP BY all columns, HAVING COUNT(*) > 1. Edge case: NULL values do not match in equality, so duplicate-with-NULL rows can hide. Use COALESCE if NULL means duplicate.

Q2 · L3

Find the second highest salary

Use DENSE_RANK to handle ties. SELECT salary FROM (SELECT salary, DENSE_RANK() OVER (ORDER BY salary DESC) AS rk FROM employees) WHERE rk = 2. LIMIT 1 OFFSET 1 fails on ties.

Q3 · L4

Deduplicate keeping most recent per user

ROW_NUMBER OVER (PARTITION BY user_id ORDER BY ts DESC) AS rn in CTE, filter rn = 1. Tie-break with secondary sort key. Better than DISTINCT for retaining all columns.

Q4 · L4

Calculate month-over-month revenue growth %

DATE_TRUNC to month, SUM, then LAG. (current - previous) / NULLIF(previous, 0) * 100. NULLIF prevents division-by-zero on first month.

Q5 · L4

Users active for 3+ consecutive days

Gap-and-island: ROW_NUMBER per user, subtract from date. Same difference = same streak. GROUP BY user, streak_key, HAVING COUNT >= 3.

Q6 · L4

Top N per group with ties

DENSE_RANK PARTITION BY group ORDER BY metric DESC, filter rk <= N. Explain why DENSE_RANK over RANK or ROW_NUMBER.

Q7 · L4

7-day rolling average of daily revenue

AVG(rev) OVER (ORDER BY date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW). Mention partial-window edge case for first 6 rows before being asked.

Q8 · L4

Self-join: pairs of employees same manager

JOIN employees e1, e2 ON e1.mgr = e2.mgr AND e1.id < e2.id. Inequality condition prevents self-pairs and reverse duplicates.

Q9 · L4

Pivot rows to columns with conditional aggregation

SUM(CASE WHEN type = 'X' THEN val END) AS x_total. Discuss vs PIVOT operator (Snowflake, MSSQL) for portability.

Q10 · L4

EXISTS vs IN performance

EXISTS short-circuits on first match. IN materializes the subquery. EXISTS preferred for large or correlated subqueries; IN preferred for small literal lists.

Q11 · L5

Recursive CTE for org chart

Base case: WHERE id = root. Recursive: JOIN cte ON cte.id = emp.mgr. Add depth column with WHERE depth < 20 to prevent cycles.

Q12 · L5

Sessionization with 30-min gap

LAG to get previous event. CASE WHEN gap > 30 min OR user changed THEN 1 ELSE 0 AS new_session. SUM new_session OVER (PARTITION BY user ORDER BY ts) AS session_id.

Q13 · L5

Median with PERCENTILE_CONT

PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY val). Discuss vs PERCENTILE_DISC. Most dialects support; a few require approximate (Spark, Hive) which is faster at scale.

Q14 · L5

Find users who did A then B within 7 days

Self-join on user_id, WHERE event_a.ts < event_b.ts AND event_b.ts <= event_a.ts + INTERVAL '7 days'. Discuss vs window-function approach.

Q15 · L5

Fill forward NULL values per user

LAST_VALUE(val IGNORE NULLS) OVER (PARTITION BY user ORDER BY ts ROWS UNBOUNDED PRECEDING). Some dialects need workaround for IGNORE NULLS.

Q16 · L5

Detect change-points in a time series

LAG to compare current to previous. CASE WHEN current != previous THEN 1 AS is_change. SUM is_change OVER ORDER BY ts gives change-id.

Q17 · L5

EXPLAIN plan reading and predicate pushdown

Identify full table scan vs index seek. Explain why a function in WHERE prevents pushdown. Cover partition pruning and column-store implications.

Q18 · L5

Skew handling in JOINs

Identify hot keys with COUNT GROUP BY join key. Salt with mod-N suffix on both sides. Aggregate then unsalt. Trade-off: extra shuffle cost vs balanced workload.

Q19 · L5

ROWS vs RANGE in window frames

ROWS counts physical rows. RANGE groups ties (rows with equal ORDER BY value). Different result on duplicate timestamps. Always specify; defaults vary by dialect.

Q20 · L6

Materialized view vs result cache vs incremental table

MV: precomputed, refreshed on schedule. Result cache: query-string match, free. Incremental table: append-only with merge logic. Trade-off: freshness vs cost vs flexibility.

12 Python Questions

Vanilla Python preferred. Pandas only when allowed. Drill these without autocomplete to build muscle memory.

Q21 · L3

Group records by a key

defaultdict(list). Iterate, append by key. O(n) time and space. Beats sort-and-iterate for unsorted input.

Q22 · L3

Read CSV with csv.DictReader

with open(path) as f: reader = csv.DictReader(f). Iterate rows as dicts. Mention encoding (utf-8 default), delimiter override.

Q23 · L4

Flatten nested JSON

Recurse on dict values. Concatenate keys with separator. Handle list values: explode or serialize, document the choice.

Q24 · L4

Dedup by composite key, keep latest

Dict keyed on tuple. Update when newer ts. O(n) time, O(n) space. Better than sort-then-iterate for clarity.

Q25 · L4

Generator for chunked CSV reading

yield chunks of N rows. Memory stays constant regardless of file size. Mention pandas read_csv chunksize as production equivalent.

Q26 · L4

Inner join two lists of dicts on key

Build dict index on smaller list. Iterate larger list, look up by key. O(n+m) time, O(min(n,m)) space.

Q27 · L4

Sessionize events with 30-min gap

Sort by user, ts. Walk list. Increment session_id when gap exceeds threshold or user changes. Edge case: events with same ts.

Q28 · L5

LRU cache from scratch

OrderedDict. On get: move_to_end. On put: insert, then popitem(last=False) if over capacity. Or implement with dict + doubly linked list for O(1) operations.

Q29 · L5

Parse log line with regex, handle malformed

Compile pattern once outside loop. Use named groups. Wrap in try/except, send malformed lines to dead-letter list, return both clean and malformed.

Q30 · L5

Stream-merge sorted iterators

heapq.merge yields lazily. O(n log k) where k is number of iterators. Beats materializing all and sorting.

Q31 · L5

Concurrent fetch with rate limit

asyncio + Semaphore(N). Bounds in-flight requests. Mention threading.Semaphore for sync code. Discuss why rate limiting matters for downstream services.

Q32 · L5

Pandas: SCD Type 2 merge logic

Identify rows where source differs from target. For changed rows: expire current (set valid_to, is_current=False), insert new (valid_from=now, is_current=True). Use merge with indicator.

10 Data Modeling Questions

Schema design and trade-off defense. Practice drawing on a whiteboard, narrating the grain first.

Q33 · L4

Star schema for e-commerce

Grain: one row per order line item. Fact: order_item_id, FKs, quantity, unit_price, total. Dims: customer (Type 1 + Type 2), product, date. Volunteer 3 trade-offs.

Q34 · L4

Define grain of a fact table

One sentence: 'one row per X per Y'. Determines whether the rest of the model is right. State before drawing.

Q35 · L4

Surrogate vs natural keys

Surrogate when natural key is unstable, when SCD Type 2 is needed, or when natural key is composite. Natural when stable, simple, human-readable for joins.

Q36 · L5

SCD Type 2 implementation

Surrogate, natural, valid_from, valid_to, is_current. Expire-and-insert pattern on change. Facts join on surrogate for point-in-time correctness.

Q37 · L5

Conformed dimensions across data marts

Same dim_customer schema across marts. Single source of truth. Avoids analyst confusion when joining across marts. Stating this without prompting is a senior signal.

Q38 · L5

Slowly changing fact (corrections)

Two patterns: append-only with version column (audit trail), or in-place update with audit log table. Trade-off: query simplicity vs storage cost vs auditability.

Q39 · L5

Bridge table for many-to-many

Two FKs, one to each dim. Optional weighting factor for fractional attribution. Common in healthcare (patient-diagnosis), retail (product-category), ad tech (impression-conversion).

Q40 · L5

Late-arriving dimensions

Insert placeholder dim row with surrogate key, mark is_late=true. Update when real row arrives. Or: backfill facts after dim arrives. Most candidates skip; raising it is a senior signal.

Q41 · L5

Medallion architecture trade-offs

Bronze: raw, immutable. Silver: cleaned, conformed. Gold: business-ready aggregates. Trade-off: storage cost (3x) vs query performance vs auditability vs reprocessing flexibility.

Q42 · L6

Data Vault 2.0 vs Kimball

Vault: hubs, links, satellites. Better for regulated industries needing audit trail. Kimball: star schema, simpler queries. Most analytics workloads pick Kimball; financial / pharma pick Vault.

5 System Design Questions

60-minute design rounds. Use the 4-step framework: clarify, draw, narrate, fail.

Q43 · L4

Daily ETL from Postgres to Snowflake

Debezium CDC -> Kafka -> S3 raw -> Spark daily ETL with run_id -> Snowflake MERGE. Idempotent, deterministic. Backfill via DAG params.

Q44 · L5

Real-time clickstream pipeline at 200K events/sec

Kafka 100 partitions, key=user_id -> Flink stateful, exactly-once, sessionize 30-min gap -> S3 + Materialize. Hourly Spark to Snowflake. Cover 3 failure modes.

Q45 · L5

Online + offline ML feature store

Real-time: Flink -> Redis (10ms reads). Batch: Spark -> S3 feature parquet -> Feast catalog. Training: as_of_join with feature_ts <= label_ts to prevent leakage.

Q46 · L5

Daily reconciliation pipeline for payments

Postgres -> Debezium -> Kafka -> S3 raw immutable -> idempotent Spark with run_id -> Snowflake MERGE on (txn_id, run_id). Audit by source_event_id.

Q47 · L6

Multi-region active-active data warehouse

Region-local writes, async cross-region replication via CDC stream. Conflict resolution: last-writer-wins or CRDT for counters. SLA tiers: real-time vs eventually consistent. Cost: 2x storage minimum.

3 Behavioral Questions

STAR-D format. Specific numbers required. End with a decision postmortem.

Q48 · L4-L5

Tell me about a project with measurable impact

Situation, Task, Action, Result with specific numbers (latency, dollars, hours, downstream consumers), then Decision postmortem: what you would do differently and why. The postmortem is the L5 signal.

Q49 · L4-L5

Tell me about a disagreement with a stakeholder

How you held a position with data, listened to the counter, changed your mind when warranted. Wrong answer: you have never been wrong. Right answer: a specific resolution and what it taught you.

Q50 · L5-L6

Tell me about a real failure

Real failure with real consequence. Root cause, what you changed in your process, and what you would tell someone facing the same setup today. Faux failures (works too hard) are an instant downgrade.

How to Use the 50 Questions

Drill all 50 over 4 weeks. Speak the answers out loud. Time yourself: SQL medium under 12 min, hard under 20. Python medium under 15, hard under 25. Modeling under 10 minutes per schema. Design under 60 min per architecture.

Pair the questions with the round-specific deep guides: window functions and SQL patterns interviewers test, vanilla Python patterns interviewers test, star schema and SCD round prep, system design framework for data engineers, behavioral interview prep for Data Engineer. The deep guides explain the framework; this list gives you the practice volume.

Targeting a specific company? After drilling these 50, open the matching company guide: Stripe data engineering interview prep, Airbnb data engineering interview prep, Netflix data engineering interview prep, etc.

Prepare for the interview

01 / Open invite

02min.

Know the patterns before the interviewer asks them.

a system design query, the same shape a screen would give you.

The diff against expected. Where ties broke. What you missed.

sandbox

1source → bronze → silver → gold

2 ingest : CDC + Kafka

3 transform : dbt + Airflow

4 serve : Snowflake

Execute your solution0.4s avg.

PayPalInterview question

Solve a problem

Data engineer interview prep FAQ

Are these the only 50 questions I should prep?+

If you have 4 weeks or less, yes. If you have more time, use this as the core set and expand to the top 100 problem set for additional coverage. Companies do ask questions outside any curated list, but the 50 here are the highest-leverage practice volume.

Why only 5 system design questions?+

System design is depth over breadth. Five well-prepared architectures (clickstream, daily ETL, ML feature store, reconciliation, multi-region) give you the patterns to handle most variants. Drilling 20 design problems shallowly is worse than drilling 5 deeply.

How do I know whether my answer is strong enough?+

A few markers indicate readiness: solving a medium SQL question in under twelve minutes, calling out an edge case before the interviewer asks, naming one specific trade-off, and adding a scaling note for how the answer would change at a hundred times the data volume. Most answers that hit those markers land well with interviewers.

Should I memorize the SQL syntax or write from scratch each time?+

Write from scratch each time. Memorized solutions break under follow-up questions. Practice the patterns until you can derive the SQL on the spot.

What if I see a question on this list in my interview?+

Treat it like any other: restate, plan out loud, code while narrating, test, discuss edge cases. Do not signal that you've seen it before; that can be a downgrade signal in some loops.

How does this list compare to LeetCode or DataLemur?+

LeetCode is algorithmic; this list is data engineering. DataLemur overlaps on SQL questions. The advantage of this list is the level tags and the focus on data-engineering patterns (sessionization, deduplication, idempotency) instead of algorithm puzzles.

02 / Why practice

Run the 50 in the practice harness

01
Active recall beats re-reading by 50%
Cognitive-science meta-reviews (Dunlosky et al., 2013) rank practice testing as a top-tier study technique, while re-reading and highlighting rank near the bottom
02
76% of hiring managers reject on the coding task, not the resume
From HackerRank's 2024 Developer Skills Report. Candidates who look strong on paper still fail the live screen if they haven't done timed, executable practice
03
System design is graded on the calls you defend out loud
Ingestion, batch vs streaming, the bronze/silver/gold layers, idempotency, backfill and replay. Sketching the pipeline and naming the failure modes is the signal, not the boxes

Open the problems

Adjacent Data Engineer Interview Prep Reading

Top 100 Data Engineer Interview Questions→

Twice the coverage when you have time for deeper prep.

Free Data Engineer Interview Problem Set→

Same 100 questions, runnable in-browser after sign-in or open-source on GitHub.

Complete Data Engineer Interview Prep Framework→

Pillar guide covering every round in the Data Engineer loop, end to end.

More data engineer interview prep guides

data engineer interview Q&A→

Free bank of 100+ data engineer interview questions and answers, runnable in-browser or open-source on GitHub. Updated 2026.

the definitive top 100 Data Engineer interview questions→

100 of the most asked data engineer interview questions across all four domains.

the FAANG Data Engineer interview problem set→

Real questions from Meta, Amazon, Apple, Netflix, and Google Data Engineer loops, with answers.

annotated Data Engineer take-home examples→

Real take-home prompts from Stripe, Airbnb, Databricks, with annotated example solutions.

window functions and SQL patterns interviewers test→

Window functions, gap-and-island, and the patterns interviewers test in 95% of Data Engineer loops.

vanilla Python patterns interviewers test→

JSON flattening, sessionization, and vanilla-Python data wrangling in the Data Engineer coding round.