The Definitive Bank

Top 100 Data Engineer Interview Questions

The 100 highest-frequency data engineer interview questions in 2026, with worked answers and the frequency tag (how often each question appears in our dataset). Selected from 1,042 verified interview reports plus 2,400 mock interviews. Each question links to the deeper round-specific guide for context. This is the bank to drill if you have 6+ weeks to prep, paired with the complete data engineer interview preparation framework.

The Short Answer
The full 100 questions are organized by domain and difficulty below. Drill order matters: SQL first (it gates the rest of the loop), then Python (the second-most-common live coding format), then modeling (the L4-to-L5 differentiator), then design (the L5+ ceiling). If you only have time for 50, start with the the curated 50 Data Engineer interview questions bank, which is the top 50 of these 100. The full PDF version is at downloadable Data Engineer interview questions PDF.
Updated April 2026·By The DataDriven Team

Question Distribution Across the 100

Distribution mirrors interview-loop frequency. SQL dominates at 40%, reflecting its presence in 95% of loops.

DomainCountDrill Time
SQL4016 hours
Python2510 hours
Data Modeling208 hours
System Design1010 hours
Behavioral53 hours

SQL: The First 40

SQL is the single most-tested domain. These 40 cover joins, aggregation, window functions, CTEs, recursive queries, optimization, and dialect-specific tricks.

Q1 · L3

INNER vs LEFT vs FULL OUTER JOIN

INNER returns matched rows. LEFT keeps all left rows, NULL on no match. FULL keeps all rows on both sides. Most candidates know this; fluent candidates volunteer the row-count expectations.
Q2 · L3

GROUP BY with HAVING vs WHERE

WHERE filters rows before aggregation. HAVING filters groups after. WHERE supports indexes; HAVING does not. Move conditions to WHERE when they don't reference aggregates.
Q3 · L3

Find duplicate rows

GROUP BY all columns, HAVING COUNT > 1. NULL caveat: equality is unknown for NULL, so duplicate-with-NULL hides. COALESCE if NULL means same.
Q4 · L3

Second highest salary

DENSE_RANK to handle ties. SELECT salary FROM (SELECT salary, DENSE_RANK() OVER (ORDER BY salary DESC) rk) WHERE rk = 2.
Q5 · L3

Count rows per group

GROUP BY group_col, SELECT COUNT(*). For unique counts: COUNT(DISTINCT col). Beware: COUNT(col) ignores NULL, COUNT(*) does not.
Q6 · L3

Sort with NULLS FIRST or LAST

ORDER BY col NULLS LAST. Postgres and Snowflake support directly. MySQL needs ORDER BY col IS NULL, col.
Q7 · L4

Deduplicate keeping latest per user

ROW_NUMBER PARTITION BY user_id ORDER BY ts DESC, filter rn = 1. Better than DISTINCT for keeping all columns.
Q8 · L4

Month-over-month growth percentage

DATE_TRUNC, SUM, LAG. (current - previous) / NULLIF(previous, 0) * 100. Volunteer NULLIF for first month.
Q9 · L4

Users active 3+ consecutive days

Gap-and-island. ROW_NUMBER per user minus date. Same diff = same streak. GROUP BY user, streak_key, HAVING COUNT >= 3.
Q10 · L4

Top N per group with ties

DENSE_RANK PARTITION BY, filter rk <= N. Explain DENSE_RANK over RANK over ROW_NUMBER.
Q11 · L4

7-day rolling average

AVG OVER (ORDER BY date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW). Volunteer partial-window edge case.
Q12 · L4

Self join: same-manager pairs

JOIN e1, e2 ON e1.mgr = e2.mgr AND e1.id < e2.id. Inequality eliminates self-pairs and reverse duplicates.
Q13 · L4

Pivot rows to columns

SUM(CASE WHEN type = X THEN val) AS x_total. Discuss vs PIVOT operator for portability.
Q14 · L4

EXISTS vs IN performance

EXISTS short-circuits. IN materializes. EXISTS for large or correlated; IN for small literals.
Q15 · L4

COALESCE vs CASE WHEN NULL

COALESCE returns first non-null. Cleaner than CASE for default-value substitution. Some dialects (Spark) prefer NVL or IFNULL.
Q16 · L4

DATE_TRUNC vs EXTRACT vs DATE_PART

TRUNC zeros lower units (returns date). EXTRACT pulls a number (year, month). PART aliases vary by dialect. State the dialect before answering.
Q17 · L4

UNION vs UNION ALL

UNION dedupes (expensive sort). UNION ALL keeps duplicates (no sort). Use ALL unless dedup is required.
Q18 · L4

ANTI JOIN with NOT EXISTS

SELECT * FROM left WHERE NOT EXISTS (SELECT 1 FROM right WHERE join). Faster than LEFT JOIN + WHERE right.col IS NULL on most engines.
Q19 · L4

Find rows with duplicate composite key

GROUP BY composite key, HAVING COUNT > 1. To return the duplicate rows themselves: WHERE (key1, key2) IN (SELECT ... HAVING COUNT > 1).
Q20 · L4

Conditional aggregation by year

SUM(CASE WHEN year = 2024 THEN rev END). Cleaner than separate WHERE-filtered subqueries when you need multiple year totals.
Q21 · L5

Recursive CTE for org chart

Base case: WHERE id = root. Recursive: JOIN cte ON cte.id = emp.mgr. Add depth column with WHERE depth < 20 to prevent cycles.
Q22 · L5

Sessionization with 30-min gap

LAG to get previous event ts. CASE WHEN gap > 30 min OR user changed THEN 1 ELSE 0 AS new_session. SUM new_session OVER PARTITION BY user gives session_id.
Q23 · L5

Median with PERCENTILE_CONT

PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY val). Discuss PERCENTILE_DISC vs CONT. For huge data, mention APPROX_PERCENTILE.
Q24 · L5

Funnel: A then B within 7 days

Self-join on user, WHERE event_a.ts < event_b.ts AND event_b.ts <= event_a.ts + 7 days. Discuss vs window-function approach.
Q25 · L5

Forward-fill NULL per user

LAST_VALUE(val IGNORE NULLS) OVER (PARTITION BY user ORDER BY ts ROWS UNBOUNDED PRECEDING). Some dialects need workaround.
Q26 · L5

Detect change-points

LAG to compare current to previous. CASE WHEN current != previous THEN 1 AS is_change. SUM is_change OVER ORDER BY ts gives change-id.
Q27 · L5

EXPLAIN plan reading

Identify full table scan vs index seek. Function in WHERE prevents pushdown. Cover partition pruning, broadcast vs shuffle joins.
Q28 · L5

Skew handling in JOINs

Identify hot keys with COUNT GROUP BY. Salt with mod-N suffix on both sides. Aggregate, then unsalt. Trade-off: extra shuffle cost.
Q29 · L5

ROWS vs RANGE in window frames

ROWS: physical rows. RANGE: groups ties. Different result on duplicate timestamps. Always specify; defaults vary.
Q30 · L5

QUALIFY for window-function filtering

Snowflake/BigQuery shortcut. SELECT ... QUALIFY ROW_NUMBER() OVER (...) = 1. Replaces the CTE-and-filter pattern. Mention dialect support.
Q31 · L5

MERGE / UPSERT for slowly-changing data

MERGE target USING source ON key WHEN MATCHED AND data differs THEN UPDATE WHEN NOT MATCHED THEN INSERT. Postgres uses INSERT ON CONFLICT.
Q32 · L5

Pivot with dynamic columns

Most engines need static column list. For truly dynamic: build SQL string, EXECUTE IMMEDIATE. Snowflake PIVOT supports ANY ORDER BY for dynamic.
Q33 · L5

Lateral / CROSS APPLY for row-correlated subquery

LATERAL JOIN evaluates subquery per row. Useful for top-N-per-group when window functions are awkward. CROSS APPLY in MSSQL.
Q34 · L5

Date dimension generation

Generate from CTE: SELECT DATE '2020-01-01' + n FROM generate_series(0, 3650). For Snowflake: GENERATOR + ROW_NUMBER. Used to fill date gaps in reports.
Q35 · L5

Approximate count distinct (HLL)

APPROX_COUNT_DISTINCT (Snowflake), HLL_COUNT (BigQuery). Constant memory regardless of cardinality. ~2% error. Use when exact count is not required.
Q36 · L5

JSON parsing in SQL

Postgres ->, ->>, jsonb. BigQuery JSON_EXTRACT. Snowflake :, get_path. State dialect first. Mention shredding strategy for analytics on JSON-heavy tables.
Q37 · L5

Array operations: UNNEST and ARRAY_AGG

UNNEST explodes array to rows. ARRAY_AGG aggregates rows to array. Common in BigQuery and Postgres for one-to-many flattening.
Q38 · L6

Materialized view vs result cache vs incremental table

MV: precomputed, scheduled refresh. Cache: query-string match, free. Incremental: append-only with merge. Trade-off: freshness vs cost vs flexibility.
Q39 · L6

Time travel and zero-copy clones (Snowflake)

Time travel: query historical state via AT (TIMESTAMP). Zero-copy clone: instant snapshot, COW storage. Used for non-prod testing without storage cost.
Q40 · L6

Iceberg vs Delta vs Hudi

All ACID over object storage. Iceberg: open spec, multi-engine. Delta: Databricks-native, best Spark integration. Hudi: best CDC support. Most companies converge on Iceberg in 2026.

Python: 25 More Questions

Beyond the top 50 Python questions, drill these for L5+ depth on data wrangling, generators, and pandas patterns.

Q41 · L4

Group records by key

defaultdict(list). Iterate, append by key. O(n) time and space. Beats sort-and-iterate for unsorted input.
Q42 · L4

CSV reading with DictReader

with open(path) as f: csv.DictReader(f). Iterate rows as dicts. Mention encoding, delimiter, quotechar overrides.
Q43 · L4

Flatten nested JSON

Recurse on dict values. Concatenate keys with separator. Handle list values: explode or serialize, document the choice.
Q44 · L4

Dedup by composite key, latest

Dict keyed on tuple. Update when newer ts. O(n) time and space.
Q45 · L4

Generator for chunked CSV

yield chunks of N rows. Memory stays constant. Mention pandas read_csv chunksize.
Q46 · L4

Inner join two lists of dicts

Build dict index on smaller list. Iterate larger, lookup by key. O(n+m) time.
Q47 · L4

Sessionize with 30-min gap

Sort by user, ts. Walk list. Increment session_id when gap > threshold or user changes.
Q48 · L4

Counter for top-N frequencies

from collections import Counter. Counter(items).most_common(N). One-liner; mention itertools.groupby alternative.
Q49 · L4

Itertools.groupby for run-length encoding

Sort first (groupby groups consecutive only). [(k, len(list(g))) for k, g in groupby(sorted(seq))].
Q50 · L4

Functools.reduce for accumulation

reduce(lambda acc, x: acc + x, seq, 0). Used for fold operations. Often clearer to use sum / explicit loop.
Q51 · L5

LRU cache from scratch

OrderedDict. On get: move_to_end. On put: insert, popitem(last=False) if over capacity. Or dict + doubly linked list for O(1).
Q52 · L5

Parse log line with regex, handle malformed

Compile pattern outside loop. Named groups. try/except, dead-letter list for malformed.
Q53 · L5

Stream-merge sorted iterators

heapq.merge yields lazily. O(n log k). Beats materialize-then-sort for k iterators.
Q54 · L5

Concurrent fetch with rate limit

asyncio + Semaphore(N). Bounds in-flight requests. Discuss why rate limiting matters for downstream services.
Q55 · L5

Pandas SCD Type 2 merge

Identify rows where source differs from target. Expire current, insert new. Use pd.merge with indicator.
Q56 · L5

Pandas pivot_table with aggfunc and fill_value

pd.pivot_table(df, index, columns, values, aggfunc='sum', fill_value=0). Cleaner than .groupby().unstack().
Q57 · L5

Pandas window operations: rolling and expanding

df.rolling(window=7).mean(). df.expanding().sum(). Equivalent to SQL window functions.
Q58 · L5

Pandas merge_asof for time-aligned join

pd.merge_asof for nearest-key join (e.g., feature ts <= label ts). Used in ML feature engineering for point-in-time correctness.
Q59 · L5

Pandas chunked groupby for large data

Iterate chunks, partial agg per chunk, combine at end. For groupby that exceeds memory. Or: dask / polars / spark.
Q60 · L5

Type hints with TypedDict and dataclasses

TypedDict for record schemas. dataclass for value objects. Both improve IDE support and runtime via Pydantic if needed.
Q61 · L5

Context manager with __enter__ and __exit__

Class with __enter__ (returns resource) and __exit__ (cleanup). Or @contextmanager decorator with yield.
Q62 · L5

Custom exception with chained context

raise CustomError('msg') from original_error. Preserves traceback. Common in ETL error handling for actionable failure messages.
Q63 · L6

Multiprocessing vs threading for I/O vs CPU

I/O bound: threading or asyncio. CPU bound: multiprocessing (GIL constraint). Discuss process pool sizing, shared memory, pickling overhead.
Q64 · L6

Cython, numba, or polars for performance

Cython: C-level perf with Python syntax. Numba: JIT for numeric code. Polars: pandas-replacement, Rust-backed. Pick based on workload type.
Q65 · L6

Property-based testing with hypothesis

@given(strategies=...) generates inputs. Catches edge cases unit tests miss. Used in critical data quality functions where input space is large.

Data Modeling: 20 Questions

Schema design, SCD, conformed dimensions, and modern lakehouse patterns. Practice drawing on a whiteboard while narrating the grain first.

Q66 · L4

Star schema for e-commerce

Grain: one row per order line item. Fact: line totals, FKs. Dims: customer (split Type 1 + Type 2), product, date.
Q67 · L4

Define grain of fact table

One sentence: 'one row per X per Y'. State before drawing. Determines if rest of model is right.
Q68 · L4

Surrogate vs natural keys

Surrogate when natural is unstable, when SCD Type 2 needed, or when natural is composite. Natural for stable, simple, human-readable joins.
Q69 · L4

Fact vs dimension classification

Facts: events, measures, tall and skinny. Dimensions: descriptive context, short and wide. Trick: 'dim_event' is usually a fact poorly named.
Q70 · L4

Star vs snowflake schema

Star: dims denormalized. Snowflake: dims normalized. Star wins for analytics in 90% of cases. Snowflake when dim is huge and rarely joined together.
Q71 · L4

Conformed dimension across marts

Same dim_customer across all marts. Single source of truth. Critical for cross-mart joins to work cleanly.
Q72 · L5

SCD Type 1 vs Type 2 vs Type 3

Type 1: overwrite. Type 2: history via valid_from/valid_to. Type 3: previous-value column. Pick by query pattern: do you need history? do you need point-in-time?
Q73 · L5

SCD Type 2 implementation

Surrogate, natural, valid_from, valid_to, is_current. Expire-and-insert pattern. Facts join on surrogate for point-in-time correctness.
Q74 · L5

Slowly changing facts (corrections)

Append-only with version column (audit), or in-place with audit log table. Trade-off: query simplicity vs storage vs auditability.
Q75 · L5

Bridge table for many-to-many

Two FKs, one to each dim. Optional weighting for fractional attribution. Common in healthcare, retail, ad tech.
Q76 · L5

Late-arriving dimensions

Insert placeholder dim row, mark is_late=true. Update when real arrives. Or backfill facts. Volunteering this is a senior signal.
Q77 · L5

Late-arriving facts

Re-aggregate downstream rollups when late fact lands. Or: process via separate late-data pipeline. Trade-off: complexity vs SLA.
Q78 · L5

Medallion architecture trade-offs

Bronze: raw, immutable. Silver: cleaned, conformed. Gold: business-ready. Trade-off: storage 3x vs query perf vs auditability vs reprocessing.
Q79 · L5

Iceberg vs Delta time-travel for SCD

Both support time travel. Often replaces SCD Type 2 for slowly-changing-dim use case. Trade-off: query simplicity vs file overhead vs metadata size.
Q80 · L5

Partitioning strategy for fact tables

Date partition for time-series facts. Composite (date + region) for multi-tenant. Avoid high-cardinality partitions (>10K). Mention partition pruning impact.
Q81 · L5

Clustering keys (Snowflake) and Z-ordering (Delta)

Both reduce scanned data on filter queries. Clustering keys: physical sort within micro-partitions. Z-order: multi-column locality. Pick top 2-3 most-filtered columns.
Q82 · L5

Schema evolution: adding nullable column

Always backward compatible. Producers add field, consumers ignore. Trickier: changing types, removing fields. Use Avro/Protobuf with schema registry.
Q83 · L5

Wide table vs star schema for analytics

Wide: pre-joined, denormalized. Star: facts + dims, joined per query. Wide wins for ML training and dashboard performance. Star wins for ad-hoc analysis.
Q84 · L6

Data Vault 2.0 vs Kimball

Vault: hubs, links, satellites. Audit-trail-friendly. Kimball: star schema, simpler queries. Pick Vault for regulated industries; Kimball for everything else.
Q85 · L6

Multi-region data model with conflict resolution

Region-local writes, async cross-region replication. Last-writer-wins for most cases; CRDTs for counters. Cost: 2x storage, complex consistency.

System Design: 10 Architectures

Use the 4-step framework: clarify, draw, narrate, fail. 60 minutes per architecture in practice.

Q86 · L4

Daily ETL Postgres -> Snowflake

Debezium CDC -> Kafka -> S3 raw -> Spark ETL with run_id -> Snowflake MERGE. Idempotent. Backfill via DAG params.
Q87 · L5

Real-time clickstream at 200K events/sec

Kafka 100 partitions key=user_id -> Flink stateful exactly-once sessionize -> S3 + Materialize. Hourly Spark to Snowflake. Cover 3 failure modes.
Q88 · L5

Online + offline ML feature store

Real-time: Flink -> Redis (10ms reads). Batch: Spark -> S3 feature parquet. Training: as_of_join with feature_ts <= label_ts to prevent leakage.
Q89 · L5

Daily reconciliation for payments

Debezium -> Kafka -> S3 raw immutable -> idempotent Spark with run_id -> Snowflake MERGE on (txn_id, run_id). Audit by source_event_id.
Q90 · L5

A/B test analysis pipeline

Event ingest -> exposure log -> outcome log -> daily aggregation by experiment_id and variant. Statistical significance computed in serving layer (Materialize or downstream service).
Q91 · L5

Recommendation feature pipeline

User events -> Kafka -> Flink (real-time features) + Spark daily (batch features) -> dual-write to Redis and S3. Feast catalog for discovery. Point-in-time correct training data.
Q92 · L5

Search index pipeline

Source documents -> Kafka -> Flink (extract, enrich) -> dual-write to Elasticsearch + S3. Full reindex via Spark batch when schema changes. Cover lag, hot keys, version cutover.
Q93 · L5

Multi-tenant data warehouse with row-level security

Single warehouse with tenant_id partition. Row-level security via masking policy. Cost-allocation via tenant-tagged jobs. Trade-off: shared infra cost vs blast radius.
Q94 · L6

Multi-region active-active warehouse

Region-local writes, async cross-region replication. Conflict resolution: last-writer-wins or CRDT. Tiered SLAs. Cost: 2x storage minimum, complex consistency model.
Q95 · L6

Cost-optimized lakehouse with tiered storage

Hot (1 day, S3 standard, fully indexed). Warm (1-30 days, S3 IA). Cold (30+ days, Glacier, queryable via Athena with SLA). Compaction job nightly. Trade-off: cost vs query latency.

Behavioral: 5 STAR-D Stories

Prepare 5 stories covering the 5 evergreen themes. Specific numbers required. End each with a decision postmortem.

Q96 · L4-L5

Project with measurable impact

STAR-D: specific numbers (latency, dollars, hours, consumers), then what you would do differently. The postmortem is the L5 signal.
Q97 · L4-L5

Disagreement with stakeholder

How you held position with data, listened to counter, changed mind when warranted. Wrong: you've never been wrong. Right: a specific resolution and lesson.
Q98 · L5-L6

Real failure with consequences

Real failure with real cost. Root cause, process change, what you'd tell someone facing same setup today. Faux failures are instant downgrade.
Q99 · L5

Project with ambiguous requirements

How you framed decisions, gathered inputs, committed when commitment mattered more than certainty. Avoid 'we did agile' framing; describe the actual decision.
Q100 · L5-L6

Leading without authority or mentoring

Specific person you mentored or specific decision you championed. Who pushed back. How you brought them along. Outcome 6+ months later.

How to Use the 100

Drill in domain order: SQL (16h), Python (10h), modeling (8h), design (10h), behavioral (3h). Total: 47 hours of focused practice. At 2 hours per day, that's 4 weeks.

Pair with the round deep guides for context: how to pass the SQL round, how to pass the Python round, how to pass the data modeling round, how to pass the system design round, how to pass the behavioral round.

Targeting FAANG specifically? After drilling these 100, open FAANG Data Engineer interview questions and answers for FAANG-tagged variants.

Data Engineer Interview Prep FAQ

How is this list different from the top 50?+
The top 50 is the highest-leverage subset for time-pressed prep. This list is the same 50 plus 50 more for additional depth. The 100 covers more dialect-specific tricks, more L5+ patterns, and more system-design variants.
Are all 100 questions answered in full?+
Yes, in this on-page version and in the downloadable PDF. Each question has a worked answer with reasoning, edge cases, and the typical follow-up.
How long should I take to drill all 100?+
47 hours of focused practice spread over 4 to 6 weeks. Don't cram. Spaced repetition over weeks beats marathon sessions.
Can I skip the behavioral section if I'm focused on technical?+
No. 47% of L5 rejections cite behavioral as the deciding factor even when technical was strong. The 5 behavioral questions take 3 hours; skipping them is a high-leverage mistake.
What if I see a question on this list in my interview?+
Treat it like any other. Restate, plan out loud, code while narrating. Do not signal you've seen it; that can be a downgrade in some loops.
Does this cover analytics engineer questions?+
About 60% overlaps. For analytics-engineer-specific prep, also see our analytics engineer interview guide for dbt, semantic layer, and BI workflow questions.

Practice the 100 Questions in the Browser

Reading the answers is step one. Run SQL and Python in our sandbox to build the muscle memory that wins the offer.

Start Practicing Now

More Data Engineer Interview Prep Guides

Continue your prep

Data Engineer Interview Prep, explore the full guide

50+ guides covering every round, company, role, and technology in the data engineer interview loop. Grounded in 2,817 verified interview reports across 929 companies, collected from real candidates.

Interview Rounds

By Company

By Role

By Technology

Decisions

Question Formats