Top 100 Data Engineer Interview Questions
Question Distribution Across the 100
Distribution mirrors interview-loop frequency. SQL dominates at 40%, reflecting its presence in 95% of loops.
| Domain | Count | Drill Time |
|---|---|---|
| SQL | 40 | 16 hours |
| Python | 25 | 10 hours |
| Data Modeling | 20 | 8 hours |
| System Design | 10 | 10 hours |
| Behavioral | 5 | 3 hours |
SQL: The First 40
SQL is the single most-tested domain. These 40 cover joins, aggregation, window functions, CTEs, recursive queries, optimization, and dialect-specific tricks.
INNER vs LEFT vs FULL OUTER JOIN
GROUP BY with HAVING vs WHERE
Find duplicate rows
Second highest salary
Count rows per group
Sort with NULLS FIRST or LAST
Deduplicate keeping latest per user
Month-over-month growth percentage
Users active 3+ consecutive days
Top N per group with ties
7-day rolling average
Self join: same-manager pairs
Pivot rows to columns
EXISTS vs IN performance
COALESCE vs CASE WHEN NULL
DATE_TRUNC vs EXTRACT vs DATE_PART
UNION vs UNION ALL
ANTI JOIN with NOT EXISTS
Find rows with duplicate composite key
Conditional aggregation by year
Recursive CTE for org chart
Sessionization with 30-min gap
Median with PERCENTILE_CONT
Funnel: A then B within 7 days
Forward-fill NULL per user
Detect change-points
EXPLAIN plan reading
Skew handling in JOINs
ROWS vs RANGE in window frames
QUALIFY for window-function filtering
MERGE / UPSERT for slowly-changing data
Pivot with dynamic columns
Lateral / CROSS APPLY for row-correlated subquery
Date dimension generation
Approximate count distinct (HLL)
JSON parsing in SQL
Array operations: UNNEST and ARRAY_AGG
Materialized view vs result cache vs incremental table
Time travel and zero-copy clones (Snowflake)
Iceberg vs Delta vs Hudi
Python: 25 More Questions
Beyond the top 50 Python questions, drill these for L5+ depth on data wrangling, generators, and pandas patterns.
Group records by key
CSV reading with DictReader
Flatten nested JSON
Dedup by composite key, latest
Generator for chunked CSV
Inner join two lists of dicts
Sessionize with 30-min gap
Counter for top-N frequencies
Itertools.groupby for run-length encoding
Functools.reduce for accumulation
LRU cache from scratch
Parse log line with regex, handle malformed
Stream-merge sorted iterators
Concurrent fetch with rate limit
Pandas SCD Type 2 merge
Pandas pivot_table with aggfunc and fill_value
Pandas window operations: rolling and expanding
Pandas merge_asof for time-aligned join
Pandas chunked groupby for large data
Type hints with TypedDict and dataclasses
Context manager with __enter__ and __exit__
Custom exception with chained context
Multiprocessing vs threading for I/O vs CPU
Cython, numba, or polars for performance
Property-based testing with hypothesis
Data Modeling: 20 Questions
Schema design, SCD, conformed dimensions, and modern lakehouse patterns. Practice drawing on a whiteboard while narrating the grain first.
Star schema for e-commerce
Define grain of fact table
Surrogate vs natural keys
Fact vs dimension classification
Star vs snowflake schema
Conformed dimension across marts
SCD Type 1 vs Type 2 vs Type 3
SCD Type 2 implementation
Slowly changing facts (corrections)
Bridge table for many-to-many
Late-arriving dimensions
Late-arriving facts
Medallion architecture trade-offs
Iceberg vs Delta time-travel for SCD
Partitioning strategy for fact tables
Clustering keys (Snowflake) and Z-ordering (Delta)
Schema evolution: adding nullable column
Wide table vs star schema for analytics
Data Vault 2.0 vs Kimball
Multi-region data model with conflict resolution
System Design: 10 Architectures
Use the 4-step framework: clarify, draw, narrate, fail. 60 minutes per architecture in practice.
Daily ETL Postgres -> Snowflake
Real-time clickstream at 200K events/sec
Online + offline ML feature store
Daily reconciliation for payments
A/B test analysis pipeline
Recommendation feature pipeline
Search index pipeline
Multi-tenant data warehouse with row-level security
Multi-region active-active warehouse
Cost-optimized lakehouse with tiered storage
Behavioral: 5 STAR-D Stories
Prepare 5 stories covering the 5 evergreen themes. Specific numbers required. End each with a decision postmortem.
Project with measurable impact
Disagreement with stakeholder
Real failure with consequences
Project with ambiguous requirements
Leading without authority or mentoring
How to Use the 100
Drill in domain order: SQL (16h), Python (10h), modeling (8h), design (10h), behavioral (3h). Total: 47 hours of focused practice. At 2 hours per day, that's 4 weeks.
Pair with the round deep guides for context: how to pass the SQL round, how to pass the Python round, how to pass the data modeling round, how to pass the system design round, how to pass the behavioral round.
Targeting FAANG specifically? After drilling these 100, open FAANG Data Engineer interview questions and answers for FAANG-tagged variants.
Data engineer interview prep FAQ
How is this list different from the top 50?+
Are all 100 questions answered in full?+
How long should I take to drill all 100?+
Can I skip the behavioral section if I'm focused on technical?+
What if I see a question on this list in my interview?+
Does this cover analytics engineer questions?+
Practice the 100 Questions in the Browser
Reading the answers is step one. Run SQL and Python in our sandbox to build the muscle memory that wins the offer.
Adjacent Data Engineer Interview Prep Reading
More data engineer interview prep guides
Free downloadable PDF of 100+ data engineer interview questions and answers, updated 2026.
The 50 most frequently asked data engineer interview questions, with worked answers.
Real questions from Meta, Amazon, Apple, Netflix, and Google Data Engineer loops, with answers.
Real take-home prompts from Stripe, Airbnb, Databricks, with annotated example solutions.
Window functions, gap-and-island, and the patterns interviewers test in 95% of Data Engineer loops.
JSON flattening, sessionization, and vanilla-Python data wrangling in the Data Engineer coding round.