Top 100 Data Engineer Interview Questions
The hundred questions most likely to come up in a 2026 data engineering loop, each with a worked answer tight enough to fit in the few sentences you'd actually use in the interview. Pulled from verified interview reports and weighted by how often each one shows up. Pair with the complete data engineer interview preparation framework.
How the 100 are distributed
The split mirrors how often each domain shows up in a real loop. SQL gets the most because SQL is on roughly nine of every ten data engineering loops, often twice.
| Domain | Questions in this set | Share of a typical loop |
|---|---|---|
| SQL | 40 | Two rounds at most companies |
| Python | 25 | One round, occasionally PySpark |
| Data Modeling | 20 | One round, the loop-decider |
| System Design | 10 | One pipeline-design round |
| Behavioral | 5 | One round, weighted at senior |
SQL: The First 40
SQL is the single most-tested domain. These 40 cover joins, aggregation, window functions, CTEs, recursive queries, optimization, and dialect-specific tricks.
INNER vs LEFT vs FULL OUTER JOIN
GROUP BY with HAVING vs WHERE
Find duplicate rows
Second highest salary
Count rows per group
Sort with NULLS FIRST or LAST
Deduplicate keeping latest per user
Month-over-month growth percentage
Users active 3+ consecutive days
Top N per group with ties
7-day rolling average
Self join: same-manager pairs
Pivot rows to columns
EXISTS vs IN performance
COALESCE vs CASE WHEN NULL
DATE_TRUNC vs EXTRACT vs DATE_PART
UNION vs UNION ALL
ANTI JOIN with NOT EXISTS
Find rows with duplicate composite key
Conditional aggregation by year
Recursive CTE for org chart
Sessionization with 30-min gap
Median with PERCENTILE_CONT
Funnel: A then B within 7 days
Forward-fill NULL per user
Detect change-points
EXPLAIN plan reading
Skew handling in JOINs
ROWS vs RANGE in window frames
QUALIFY for window-function filtering
MERGE / UPSERT for slowly-changing data
Pivot with dynamic columns
Lateral / CROSS APPLY for row-correlated subquery
Date dimension generation
Approximate count distinct (HLL)
JSON parsing in SQL
Array operations: UNNEST and ARRAY_AGG
Materialized view vs result cache vs incremental table
Time travel and zero-copy clones (Snowflake)
Iceberg vs Delta vs Hudi
Python: 25 More Questions
Beyond the top 50 Python questions, drill these for L5+ depth on data wrangling, generators, and pandas patterns.
Group records by key
CSV reading with DictReader
Flatten nested JSON
Dedup by composite key, latest
Generator for chunked CSV
Inner join two lists of dicts
Sessionize with 30-min gap
Counter for top-N frequencies
Itertools.groupby for run-length encoding
Functools.reduce for accumulation
LRU cache from scratch
Parse log line with regex, handle malformed
Stream-merge sorted iterators
Concurrent fetch with rate limit
Pandas SCD Type 2 merge
Pandas pivot_table with aggfunc and fill_value
Pandas window operations: rolling and expanding
Pandas merge_asof for time-aligned join
Pandas chunked groupby for large data
Type hints with TypedDict and dataclasses
Context manager with __enter__ and __exit__
Custom exception with chained context
Multiprocessing vs threading for I/O vs CPU
Cython, numba, or polars for performance
Property-based testing with hypothesis
Data Modeling: 20 Questions
Schema design, SCD, conformed dimensions, and modern lakehouse patterns. Practice drawing on a whiteboard while narrating the grain first.
Star schema for e-commerce
Define grain of fact table
Surrogate vs natural keys
Fact vs dimension classification
Star vs snowflake schema
Conformed dimension across marts
SCD Type 1 vs Type 2 vs Type 3
SCD Type 2 implementation
Slowly changing facts (corrections)
Bridge table for many-to-many
Late-arriving dimensions
Late-arriving facts
Medallion architecture trade-offs
Iceberg vs Delta time-travel for SCD
Partitioning strategy for fact tables
Clustering keys (Snowflake) and Z-ordering (Delta)
Schema evolution: adding nullable column
Wide table vs star schema for analytics
Data Vault 2.0 vs Kimball
Multi-region data model with conflict resolution
System Design: 10 Architectures
Use the 4-step framework: clarify, draw, narrate, fail. 60 minutes per architecture in practice.
Daily ETL Postgres -> Snowflake
Real-time clickstream at 200K events/sec
Online + offline ML feature store
Daily reconciliation for payments
A/B test analysis pipeline
Recommendation feature pipeline
Search index pipeline
Multi-tenant data warehouse with row-level security
Multi-region active-active warehouse
Cost-optimized lakehouse with tiered storage
Behavioral: 5 stories you should already have ready
Five stories covering the five evergreen themes. Use STAR (situation, task, action, result). Specific numbers required. The senior signal is ending with what you'd do differently.
Project with measurable impact
Disagreement with stakeholder
Real failure with consequences
Project with ambiguous requirements
Leading without authority or mentoring
How to use this list
Drill in the order the loop runs them. SQL first because it's the round you'll definitely face. Modeling second because it's the round that decides most loops and rewards spaced practice. Python third because it's the most compressible if you're a working engineer. System design fourth, behavioral fifth. Three to six weeks of two-hour sessions is the realistic timeline; longer if you're starting cold, shorter if you've interviewed in the last year.
Pair each section with the round-level walkthrough for context: how to pass the SQL round, how to pass the Python round, how to pass the data modeling round, how to pass the system design round, how to pass the behavioral round. Targeting FAANG specifically? After these 100, open FAANG Data Engineer interview questions and answers for FAANG-tagged variants.
Know the patterns before the interviewer asks them.
Data engineer interview prep FAQ
How is this list different from the top 50?+
Are all 100 questions answered in full?+
How long should I take to work through all 100?+
Can I skip the behavioral section if I am focused on technical?+
What if I see a question on this list in my interview?+
Does this cover analytics engineer questions?+
Run the 100 in the practice harness
- 01
Active recall beats re-reading by 50%
Cognitive-science meta-reviews (Dunlosky et al., 2013) rank practice testing as a top-tier study technique, while re-reading and highlighting rank near the bottom
- 02
76% of hiring managers reject on the coding task, not the resume
From HackerRank's 2024 Developer Skills Report. Candidates who look strong on paper still fail the live screen if they haven't done timed, executable practice
- 03
Five problem shapes cover 80% of data engineer loops
Dedup, sessionization, top-N-per-group, slowly-changing dimensions, partition tricks. Writing the shapes by hand turns the unfamiliar into pattern recognition
More data engineer interview prep reading
The smaller, time-pressed version of this bank.
Same content, runnable in-browser after sign-in or open-source on GitHub.
Pillar guide covering every round in the Data Engineer loop, end to end.
More data engineer interview prep guides
Free bank of 100+ data engineer interview questions and answers, runnable in-browser or open-source on GitHub. Updated 2026.
The 50 most frequently asked data engineer interview questions, with worked answers.
Real questions from Meta, Amazon, Apple, Netflix, and Google Data Engineer loops, with answers.
Real take-home prompts from Stripe, Airbnb, Databricks, with annotated example solutions.
Window functions, gap-and-island, and the patterns interviewers test in 95% of Data Engineer loops.
JSON flattening, sessionization, and vanilla-Python data wrangling in the Data Engineer coding round.