Curated Question Bank

50 Data Engineer Interview Questions

The 50 most frequently asked data engineer interview questions in 2026, with worked answers. Selection criterion: each question appears in at least eight reported interview loops in our dataset of 1,042 reports collected from 2024 to 2026. The list spans SQL (20), Python (12), data modeling (10), system design (5), and behavioral (3). Pair this with the round-specific deep guides in the our data engineer interview prep hub.

The Short Answer
The 50 questions below are the highest-leverage practice set if you have 4 weeks or less to prep. Drill all 50 with timed answers spoken out loud. The SQL section takes 8 hours, Python 5, modeling 4, design 4, behavioral 2. Total: 23 focused practice hours produces interview-ready performance for L4 candidates and a strong foundation for L5. For deeper prep, see the the definitive top 100 Data Engineer interview questions for double the coverage.
Updated April 2026·By The DataDriven Team

20 SQL Questions

SQL is 95% of Data Engineer loops. Drill these until medium problems take 12 minutes and hard problems take 20.

Q1 · L3

Find duplicate rows in a table

GROUP BY all columns, HAVING COUNT(*) > 1. Edge case: NULL values do not match in equality, so duplicate-with-NULL rows can hide. Use COALESCE if NULL means duplicate.
Q2 · L3

Find the second highest salary

Use DENSE_RANK to handle ties. SELECT salary FROM (SELECT salary, DENSE_RANK() OVER (ORDER BY salary DESC) AS rk FROM employees) WHERE rk = 2. LIMIT 1 OFFSET 1 fails on ties.
Q3 · L4

Deduplicate keeping most recent per user

ROW_NUMBER OVER (PARTITION BY user_id ORDER BY ts DESC) AS rn in CTE, filter rn = 1. Tie-break with secondary sort key. Better than DISTINCT for retaining all columns.
Q4 · L4

Calculate month-over-month revenue growth %

DATE_TRUNC to month, SUM, then LAG. (current - previous) / NULLIF(previous, 0) * 100. NULLIF prevents division-by-zero on first month.
Q5 · L4

Users active for 3+ consecutive days

Gap-and-island: ROW_NUMBER per user, subtract from date. Same difference = same streak. GROUP BY user, streak_key, HAVING COUNT >= 3.
Q6 · L4

Top N per group with ties

DENSE_RANK PARTITION BY group ORDER BY metric DESC, filter rk <= N. Explain why DENSE_RANK over RANK or ROW_NUMBER.
Q7 · L4

7-day rolling average of daily revenue

AVG(rev) OVER (ORDER BY date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW). Mention partial-window edge case for first 6 rows before being asked.
Q8 · L4

Self-join: pairs of employees same manager

JOIN employees e1, e2 ON e1.mgr = e2.mgr AND e1.id < e2.id. Inequality condition prevents self-pairs and reverse duplicates.
Q9 · L4

Pivot rows to columns with conditional aggregation

SUM(CASE WHEN type = 'X' THEN val END) AS x_total. Discuss vs PIVOT operator (Snowflake, MSSQL) for portability.
Q10 · L4

EXISTS vs IN performance

EXISTS short-circuits on first match. IN materializes the subquery. EXISTS preferred for large or correlated subqueries; IN preferred for small literal lists.
Q11 · L5

Recursive CTE for org chart

Base case: WHERE id = root. Recursive: JOIN cte ON cte.id = emp.mgr. Add depth column with WHERE depth < 20 to prevent cycles.
Q12 · L5

Sessionization with 30-min gap

LAG to get previous event. CASE WHEN gap > 30 min OR user changed THEN 1 ELSE 0 AS new_session. SUM new_session OVER (PARTITION BY user ORDER BY ts) AS session_id.
Q13 · L5

Median with PERCENTILE_CONT

PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY val). Discuss vs PERCENTILE_DISC. Most dialects support; a few require approximate (Spark, Hive) which is faster at scale.
Q14 · L5

Find users who did A then B within 7 days

Self-join on user_id, WHERE event_a.ts < event_b.ts AND event_b.ts <= event_a.ts + INTERVAL '7 days'. Discuss vs window-function approach.
Q15 · L5

Fill forward NULL values per user

LAST_VALUE(val IGNORE NULLS) OVER (PARTITION BY user ORDER BY ts ROWS UNBOUNDED PRECEDING). Some dialects need workaround for IGNORE NULLS.
Q16 · L5

Detect change-points in a time series

LAG to compare current to previous. CASE WHEN current != previous THEN 1 AS is_change. SUM is_change OVER ORDER BY ts gives change-id.
Q17 · L5

EXPLAIN plan reading and predicate pushdown

Identify full table scan vs index seek. Explain why a function in WHERE prevents pushdown. Cover partition pruning and column-store implications.
Q18 · L5

Skew handling in JOINs

Identify hot keys with COUNT GROUP BY join key. Salt with mod-N suffix on both sides. Aggregate then unsalt. Trade-off: extra shuffle cost vs balanced workload.
Q19 · L5

ROWS vs RANGE in window frames

ROWS counts physical rows. RANGE groups ties (rows with equal ORDER BY value). Different result on duplicate timestamps. Always specify; defaults vary by dialect.
Q20 · L6

Materialized view vs result cache vs incremental table

MV: precomputed, refreshed on schedule. Result cache: query-string match, free. Incremental table: append-only with merge logic. Trade-off: freshness vs cost vs flexibility.

12 Python Questions

Vanilla Python preferred. Pandas only when allowed. Drill these without autocomplete to build muscle memory.

Q21 · L3

Group records by a key

defaultdict(list). Iterate, append by key. O(n) time and space. Beats sort-and-iterate for unsorted input.
Q22 · L3

Read CSV with csv.DictReader

with open(path) as f: reader = csv.DictReader(f). Iterate rows as dicts. Mention encoding (utf-8 default), delimiter override.
Q23 · L4

Flatten nested JSON

Recurse on dict values. Concatenate keys with separator. Handle list values: explode or serialize, document the choice.
Q24 · L4

Dedup by composite key, keep latest

Dict keyed on tuple. Update when newer ts. O(n) time, O(n) space. Better than sort-then-iterate for clarity.
Q25 · L4

Generator for chunked CSV reading

yield chunks of N rows. Memory stays constant regardless of file size. Mention pandas read_csv chunksize as production equivalent.
Q26 · L4

Inner join two lists of dicts on key

Build dict index on smaller list. Iterate larger list, look up by key. O(n+m) time, O(min(n,m)) space.
Q27 · L4

Sessionize events with 30-min gap

Sort by user, ts. Walk list. Increment session_id when gap exceeds threshold or user changes. Edge case: events with same ts.
Q28 · L5

LRU cache from scratch

OrderedDict. On get: move_to_end. On put: insert, then popitem(last=False) if over capacity. Or implement with dict + doubly linked list for O(1) operations.
Q29 · L5

Parse log line with regex, handle malformed

Compile pattern once outside loop. Use named groups. Wrap in try/except, send malformed lines to dead-letter list, return both clean and malformed.
Q30 · L5

Stream-merge sorted iterators

heapq.merge yields lazily. O(n log k) where k is number of iterators. Beats materializing all and sorting.
Q31 · L5

Concurrent fetch with rate limit

asyncio + Semaphore(N). Bounds in-flight requests. Mention threading.Semaphore for sync code. Discuss why rate limiting matters for downstream services.
Q32 · L5

Pandas: SCD Type 2 merge logic

Identify rows where source differs from target. For changed rows: expire current (set valid_to, is_current=False), insert new (valid_from=now, is_current=True). Use merge with indicator.

10 Data Modeling Questions

Schema design and trade-off defense. Practice drawing on a whiteboard, narrating the grain first.

Q33 · L4

Star schema for e-commerce

Grain: one row per order line item. Fact: order_item_id, FKs, quantity, unit_price, total. Dims: customer (Type 1 + Type 2), product, date. Volunteer 3 trade-offs.
Q34 · L4

Define grain of a fact table

One sentence: 'one row per X per Y'. Determines whether the rest of the model is right. State before drawing.
Q35 · L4

Surrogate vs natural keys

Surrogate when natural key is unstable, when SCD Type 2 is needed, or when natural key is composite. Natural when stable, simple, human-readable for joins.
Q36 · L5

SCD Type 2 implementation

Surrogate, natural, valid_from, valid_to, is_current. Expire-and-insert pattern on change. Facts join on surrogate for point-in-time correctness.
Q37 · L5

Conformed dimensions across data marts

Same dim_customer schema across marts. Single source of truth. Avoids analyst confusion when joining across marts. Stating this without prompting is a senior signal.
Q38 · L5

Slowly changing fact (corrections)

Two patterns: append-only with version column (audit trail), or in-place update with audit log table. Trade-off: query simplicity vs storage cost vs auditability.
Q39 · L5

Bridge table for many-to-many

Two FKs, one to each dim. Optional weighting factor for fractional attribution. Common in healthcare (patient-diagnosis), retail (product-category), ad tech (impression-conversion).
Q40 · L5

Late-arriving dimensions

Insert placeholder dim row with surrogate key, mark is_late=true. Update when real row arrives. Or: backfill facts after dim arrives. Most candidates skip; raising it is a senior signal.
Q41 · L5

Medallion architecture trade-offs

Bronze: raw, immutable. Silver: cleaned, conformed. Gold: business-ready aggregates. Trade-off: storage cost (3x) vs query performance vs auditability vs reprocessing flexibility.
Q42 · L6

Data Vault 2.0 vs Kimball

Vault: hubs, links, satellites. Better for regulated industries needing audit trail. Kimball: star schema, simpler queries. Most analytics workloads pick Kimball; financial / pharma pick Vault.

5 System Design Questions

60-minute design rounds. Use the 4-step framework: clarify, draw, narrate, fail.

Q43 · L4

Daily ETL from Postgres to Snowflake

Debezium CDC -> Kafka -> S3 raw -> Spark daily ETL with run_id -> Snowflake MERGE. Idempotent, deterministic. Backfill via DAG params.
Q44 · L5

Real-time clickstream pipeline at 200K events/sec

Kafka 100 partitions, key=user_id -> Flink stateful, exactly-once, sessionize 30-min gap -> S3 + Materialize. Hourly Spark to Snowflake. Cover 3 failure modes.
Q45 · L5

Online + offline ML feature store

Real-time: Flink -> Redis (10ms reads). Batch: Spark -> S3 feature parquet -> Feast catalog. Training: as_of_join with feature_ts <= label_ts to prevent leakage.
Q46 · L5

Daily reconciliation pipeline for payments

Postgres -> Debezium -> Kafka -> S3 raw immutable -> idempotent Spark with run_id -> Snowflake MERGE on (txn_id, run_id). Audit by source_event_id.
Q47 · L6

Multi-region active-active data warehouse

Region-local writes, async cross-region replication via CDC stream. Conflict resolution: last-writer-wins or CRDT for counters. SLA tiers: real-time vs eventually consistent. Cost: 2x storage minimum.

3 Behavioral Questions

STAR-D format. Specific numbers required. End with a decision postmortem.

Q48 · L4-L5

Tell me about a project with measurable impact

Situation, Task, Action, Result with specific numbers (latency, dollars, hours, downstream consumers), then Decision postmortem: what you would do differently and why. The postmortem is the L5 signal.
Q49 · L4-L5

Tell me about a disagreement with a stakeholder

How you held a position with data, listened to the counter, changed your mind when warranted. Wrong answer: you have never been wrong. Right answer: a specific resolution and what it taught you.
Q50 · L5-L6

Tell me about a real failure

Real failure with real consequence. Root cause, what you changed in your process, and what you would tell someone facing the same setup today. Faux failures (works too hard) are an instant downgrade.

How to Use the 50 Questions

Drill all 50 over 4 weeks. Speak the answers out loud. Time yourself: SQL medium under 12 min, hard under 20. Python medium under 15, hard under 25. Modeling under 10 minutes per schema. Design under 60 min per architecture.

Pair the questions with the round-specific deep guides: window functions and SQL patterns interviewers test, vanilla Python patterns interviewers test, star schema and SCD round prep, system design framework for data engineers, behavioral interview prep for Data Engineer. The deep guides explain the framework; this list gives you the practice volume.

Targeting a specific company? After drilling these 50, open the matching company guide: Stripe data engineering interview prep, Airbnb data engineering interview prep, Netflix data engineering interview prep, etc.

Data Engineer Interview Prep FAQ

Are these the only 50 questions I should prep?+
If you have 4 weeks or less, yes. If you have more time, use this as the core set and expand to the top 100 question bank for additional coverage. Companies do ask questions outside any curated list, but the 50 here are the highest-leverage practice volume.
Why only 5 system design questions?+
System design is depth over breadth. Five well-prepared architectures (clickstream, daily ETL, ML feature store, reconciliation, multi-region) give you the patterns to handle most variants. Drilling 20 design problems shallowly is worse than drilling 5 deeply.
How do I know if my answer to a question is good enough?+
Time it (medium SQL under 12 min). State the edge case unprompted. Volunteer one trade-off. Add a 'what changes at 100x scale' sentence. If your answer hits all four, it's interview-ready.
Should I memorize the SQL syntax or write from scratch each time?+
Write from scratch each time. Memorized solutions break under follow-up questions. Practice the patterns until you can derive the SQL on the spot.
What if I see a question on this list in my interview?+
Treat it like any other: restate, plan out loud, code while narrating, test, discuss edge cases. Do not signal that you've seen it before; that can be a downgrade signal in some loops.
How does this list compare to LeetCode or DataLemur?+
LeetCode is algorithmic; this list is data engineering. DataLemur overlaps on SQL questions. The advantage of this list is the level tags and the focus on data-engineering patterns (sessionization, deduplication, idempotency) instead of algorithm puzzles.

Run the 50 Questions in the Browser

Reading the answers is step one. Run SQL and Python against real schemas in our sandbox to build the muscle memory that gets you the offer.

Start Practicing Now

More Data Engineer Interview Prep Guides

Continue your prep

Data Engineer Interview Prep, explore the full guide

50+ guides covering every round, company, role, and technology in the data engineer interview loop. Grounded in 2,817 verified interview reports across 929 companies, collected from real candidates.

Interview Rounds

By Company

By Role

By Technology

Decisions

Question Formats