Every question below comes from verified interview rounds at real companies. We include the difficulty level, which companies ask similar questions, what the interviewer is actually evaluating, and a concise approach hint. These 50 questions cover all 5 data engineering domains: SQL, Python, Data Modeling, Pipeline Architecture, and Spark. DataDriven has 1,000+ more with full solutions, real code execution, and AI grading.
Questions on this page
Domains covered
Total on DataDriven
Companies represented
Each question below includes four pieces of information that most question lists skip. First, the companies field tells you which companies have asked similar questions in real interviews. Second, the difficulty tag (Easy, Medium, Hard) matches the grading scale used by FAANG interviewers. Third, the what the interviewer is really testing field reveals the hidden evaluation criteria behind the question. Most candidates answer the surface-level question without addressing the deeper skill the interviewer is probing. Fourth, the approach hint gives you a starting direction without spoiling the full solution.
Don't read these passively. For each question, spend 5 minutes thinking about your approach before reading the hint. Then open DataDriven and solve it with real code execution. The AI grader will tell you what you missed.
The distribution across domains mirrors real interview frequency. SQL gets the most questions (41% of real DE interviews are SQL). Spark gets a dedicated section because senior and Spark-specific roles test it heavily, even though it appears in a smaller percentage of overall interview loops.
10 questions. Sorted by difficulty within each domain. Every question links to a solvable problem on DataDriven with full AI grading.
What the interviewer is really testing
Whether you can think in sets rather than loops. The interviewer wants to see if you reach for HAVING COUNT(DISTINCT month) = 12 or if you try to write 12 separate conditions.
Approach hint
Extract the month from order_date with DATE_TRUNC, COUNT DISTINCT months per customer, filter with HAVING. A CTE keeps it readable.
What the interviewer is really testing
Your awareness of percentile functions and performance. Many candidates try to sort all rows and pick the middle one, which is O(n log n). The interviewer wants to see PERCENTILE_CONT or a ROW_NUMBER approach.
Approach hint
Use PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY duration) if the database supports it. Otherwise, ROW_NUMBER with a total count and filter for the middle row(s). Mention that exact median on billions of rows may require approximate methods.
What the interviewer is really testing
ROW_NUMBER partitioning. This is the single most common SQL pattern in DE interviews. The interviewer is checking if you can write it from memory in under 3 minutes.
Approach hint
ROW_NUMBER() OVER (PARTITION BY user_id, product_id ORDER BY updated_at DESC) in a CTE. Filter WHERE rn = 1. Mention that in production you would use this inside a MERGE or DELETE statement.
What the interviewer is really testing
Window function framing with PARTITION BY. The reset-per-month requirement catches candidates who forget to partition by the month.
Approach hint
SUM(revenue) OVER (PARTITION BY DATE_TRUNC('month', day) ORDER BY day ROWS UNBOUNDED PRECEDING). Walk through why ROWS matters here instead of RANGE.
What the interviewer is really testing
LAG with aggregation. You need to aggregate first (weekly counts), then apply the window function. Many candidates try to do both in a single query and get confused.
Approach hint
CTE 1: aggregate to weekly login counts per user. CTE 2: LAG(login_count) OVER (PARTITION BY user_id ORDER BY week). Final query: filter where current < 0.5 * previous.
What the interviewer is really testing
Recursive CTEs with accumulation. The quantity multiplication across levels separates strong candidates from average ones.
Approach hint
Base case: direct components of the root product with their quantities. Recursive step: join to the BOM table and multiply parent quantity by child quantity. Sum at the end to get total quantities per leaf component.
What the interviewer is really testing
JOIN logic and how you handle ties. The interviewer wants to see if you address the ambiguity: do you pick one product arbitrarily (ROW_NUMBER) or return all products (RANK)?
Approach hint
Ask the interviewer how to handle ties before writing code. Then use RANK or ROW_NUMBER accordingly, partitioned by customer_id and ordered by purchase_date.
What the interviewer is really testing
LEAD/LAG to compare adjacent rows. This is a classic gap detection problem that shows up in financial data engineering interviews.
Approach hint
Order by invoice_number, use LEAD(invoice_number) OVER (ORDER BY invoice_number). Where LEAD minus current is greater than 1, you have a gap. Output the start and end of each gap.
What the interviewer is really testing
Sessionization is one of the most practical DE SQL problems. The interviewer is checking whether you can use LAG to detect gaps, then SUM a boolean flag to create session IDs.
Approach hint
LAG(timestamp) OVER (PARTITION BY user_id ORDER BY timestamp). Flag rows where the gap exceeds 30 minutes. SUM the flags as a running total to create session IDs.
What the interviewer is really testing
Conditional aggregation. If you reach for PIVOT syntax, the interviewer may ask you to do it with CASE WHEN instead, since not all databases support PIVOT.
Approach hint
MAX(CASE WHEN subject = 'Math' THEN score END) AS math_score. Group by student_id. Mention that this approach works across all SQL dialects.
10 questions. Sorted by difficulty within each domain. Every question links to a solvable problem on DataDriven with full AI grading.
What the interviewer is really testing
Memory-efficient processing with generators or heapq. The interviewer is filtering for candidates who understand that pandas.read_csv on 50GB will crash.
Approach hint
Use a min-heap of size 10 (heapq.nsmallest or nlargest). Read the file line by line or in chunks with csv.reader. Never call pd.read_csv() without chunksize on a 50GB file.
What the interviewer is really testing
Error handling patterns in data pipelines. The interviewer wants to see try/except with specific exceptions, not a bare except that swallows everything.
Approach hint
Define expected types and required fields. Iterate records, validate each field, collect errors per record. Valid records go to output, invalid records go to a DLQ list with the error message attached. Return both lists.
What the interviewer is really testing
Decorator syntax and production patterns. Jitter is the key detail that separates textbook answers from real-world experience.
Approach hint
Inner function catches exceptions, sleeps for (2 ** attempt) + random.uniform(0, 1) seconds, retries. After max retries, re-raise the last exception. Mention functools.wraps to preserve the original function metadata.
What the interviewer is really testing
Recursion and string manipulation. This is one of the most common Python DE questions. The interviewer may add follow-up constraints: handle lists, handle None values, limit depth.
Approach hint
Recursive function that takes a dictionary and a prefix. For each key, if the value is a dict, recurse with prefix + key + '.'. Otherwise, add to the result. Handle edge cases: empty dicts, None values.
What the interviewer is really testing
Concurrency fundamentals. The interviewer wants to see proper use of Queue (thread-safe), a poison pill or sentinel to signal shutdown, and clean thread cleanup.
Approach hint
Producer thread reads data and calls queue.put(). Consumer thread loops on queue.get() until it receives a sentinel value. Use threading.Thread with daemon=True. Put the sentinel after all data is produced.
What the interviewer is really testing
Data integrity and change detection. This is a practical pipeline problem, not an algorithm puzzle. The interviewer wants to see that you think about column ordering, NULL handling, and type consistency.
Approach hint
Sort columns alphabetically, convert each row to a canonical string representation, hash with hashlib.sha256. Combine row hashes into a single checksum. Mention that in production, you might hash a sample for performance.
What the interviewer is really testing
File processing at scale and the collections module. The interviewer expects you to use Counter or defaultdict, not a hand-rolled counting loop.
Approach hint
Read line by line. Parse with a regex or split(). Use collections.defaultdict(Counter) keyed by status code. After processing, extract top 100 from each status code's Counter with most_common(100).
What the interviewer is really testing
Time-based data structures and the deque module. A fixed window is easy; a sliding window requires tracking individual request timestamps.
Approach hint
Use collections.deque to store timestamps. On each request, remove timestamps older than 1 second from the left. If len(deque) < N, allow the request and append the current timestamp. Otherwise, deny.
What the interviewer is really testing
Iterator protocol and lazy evaluation. The constraint against materializing is the key part. The interviewer wants a generator function.
Approach hint
Use heapq.merge, or implement manually: peek at the head of each iterator, yield the smaller one, advance that iterator. Handle exhaustion of one iterator by yielding the rest of the other.
What the interviewer is really testing
Time-window joins in Python. This is the Python equivalent of an interval join. The interviewer wants to see efficient lookup (sorted + bisect or a dict of sorted timestamps) rather than O(n^2) nested loops.
Approach hint
Group events by user. For each user, sort views and purchases by timestamp. For each view, binary search purchases within 24 hours. Count attributed conversions. Conversion rate = attributed views / total views.
8 questions. Sorted by difficulty within each domain. Every question links to a solvable problem on DataDriven with full AI grading.
What the interviewer is really testing
Grain selection and SCD handling. The tricky part is modeling menu item prices that change over time: do you snapshot the price at order time, or reference a slowly changing dimension?
Approach hint
Fact table: order_items at one row per item per order. Store the price at time of order in the fact table (denormalize). Dimensions: dim_restaurant, dim_menu_item, dim_customer, dim_driver, dim_date. Discuss why the price snapshot belongs in the fact table.
What the interviewer is really testing
SCD Type 2 for subscription state transitions. The interviewer wants to see how you track the full lifecycle of a subscription: trial start, conversion, plan changes, churn.
Approach hint
SCD Type 2 on dim_subscription with effective_date, end_date, is_current. Fact table: subscription_events (event_type, old_plan, new_plan, mrr_change). This lets analysts compute MRR, churn rate, and upgrade paths.
What the interviewer is really testing
Whether you can identify the right grain when there are multiple event types (likes, comments, shares, views). One fact table or multiple?
Approach hint
One fact table: engagement_events with event_type as a degenerate dimension. Grain: one row per engagement event. Dimensions: dim_user (who engaged), dim_content (what was engaged with), dim_content_creator, dim_date. Discuss the tradeoff of one wide fact table vs separate fact tables per event type.
What the interviewer is really testing
Graph modeling in a relational database. Lineage is a DAG, and the interviewer wants to see if you can model nodes and edges cleanly.
Approach hint
Tables: datasets (id, name, schema, owner) and lineage_edges (source_dataset_id, target_dataset_id, pipeline_id, last_run_at). Use a recursive CTE to traverse upstream or downstream. Discuss how to handle schema-level lineage (column-to-column) vs table-level.
What the interviewer is really testing
Performance awareness in data modeling. SCD Type 2 on 500M rows requires specific strategies: partitioning, incremental processing, hash-based change detection.
Approach hint
Hash all non-key columns into a checksum. Compare incoming records by hash, not by individual columns. Process only changed records. Partition the dimension by a date range or business key range. Mention that a full table scan comparison on 500M rows is not viable in production.
What the interviewer is really testing
Role-playing dimensions. Both buyers and sellers are people, but they play different roles. The interviewer wants to see how you model this cleanly.
Approach hint
One dim_user table with buyer and seller as foreign keys in the fact table: fact_transactions(buyer_id, seller_id, ...). Or create role-specific views. Discuss the tradeoff: single dimension with role-playing aliases vs duplicated dimensions.
What the interviewer is really testing
Pragmatism vs theory. The interviewer wants to see that you know when denormalization wins: when the dimension cardinality is low, joins are expensive, and the consumer is a BI tool that performs better on flat tables.
Approach hint
OBT works well for small to medium datasets with low-cardinality dimensions, where the primary consumer is a BI tool. Star schema is better when you have high-cardinality dimensions, need to update dimension attributes without rewriting the fact table, or serve multiple use cases from the same model.
What the interviewer is really testing
Many-to-many relationships and bridge tables. A single encounter can have multiple diagnoses, and a single diagnosis can span multiple encounters. The interviewer checks if you know to use a bridge table.
Approach hint
Fact: encounters. Bridge table: encounter_diagnoses (encounter_id, diagnosis_id, is_primary). Dimensions: dim_patient, dim_provider, dim_diagnosis (ICD-10 codes), dim_date. Discuss HIPAA considerations for PII in dimension tables.
8 questions. Sorted by difficulty within each domain. Every question links to a solvable problem on DataDriven with full AI grading.
What the interviewer is really testing
Stream processing architecture and latency requirements. The interviewer wants to hear Kafka plus Flink (or Kafka Streams), not a batch solution with 5-minute intervals.
Approach hint
Kafka ingestion, Flink for real-time feature computation (rolling transaction count, velocity checks), ML model scoring in the stream, output to a fast lookup store (Redis) and alert service. Discuss exactly-once semantics and what happens when the model service is down.
What the interviewer is really testing
Migration planning and dual-write patterns. The interviewer wants a phased approach, not a big-bang cutover.
Approach hint
Phase 1: historical backfill using bulk export/import. Phase 2: CDC (Change Data Capture) to stream ongoing changes to both Oracle and Snowflake. Phase 3: validation (row counts, checksums, query result comparison). Phase 4: cutover with rollback plan. Discuss how to handle in-flight transactions during cutover.
What the interviewer is really testing
Abstraction and configuration-driven design. The interviewer wants to see a connector framework, not 50 separate scripts.
Approach hint
Abstract connector interface with methods: authenticate(), fetch_schema(), extract_batch(), get_state(). Configuration per source: API endpoint, auth type, rate limit, schema mapping. Orchestrate with Airflow, one DAG per source. State management for incremental extraction. Centralized error handling and monitoring.
What the interviewer is really testing
Systematic performance debugging. The interviewer does not want you to guess. They want a methodology: profile, identify bottlenecks, fix the biggest one first.
Approach hint
Profile each task's duration. Identify the top 3 bottlenecks. Common fixes: parallelize independent tasks, optimize the slowest SQL queries (missing indexes, bad join order, unnecessary full scans), increase compute resources for the bottleneck step, switch from full refresh to incremental loads. Quantify expected improvement for each fix.
What the interviewer is really testing
Proactive data quality thinking. The interviewer wants to see specific checks (not just 'we test the data'), alerting logic, and what happens when a check fails.
Approach hint
Checks: row count within expected range, NULL rate per column below threshold, referential integrity (no orphan foreign keys), freshness (data arrived within SLA), distribution drift (sudden spikes or drops). Run checks after each pipeline stage. On failure: block downstream processing, alert on-call, log to a data quality dashboard.
What the interviewer is really testing
Whether you have real opinions based on experience, not just what you read in a blog post. The interviewer wants tradeoffs, not a feature checklist.
Approach hint
Airflow: mature ecosystem, huge community, but DAG serialization and scheduler performance degrade at scale. Dagster: asset-centric model is great for data-aware orchestration, but smaller community. Prefect: Pythonic API, good for dynamic workflows, but less battle-tested at 200+ DAG scale. Mention what you have actually used and what problems you hit.
What the interviewer is really testing
Late data handling and partition management. The 72-hour late arrival window means you can not close partitions after one day. The interviewer wants to see watermarking or partition reprocessing strategies.
Approach hint
Partition by event_time, not ingestion_time. Keep partitions open for reprocessing for 72 hours. Use watermarks in the streaming layer. For the batch layer, reprocess the last 3 days on every run (idempotent overwrites). Discuss the storage cost of reprocessing vs the correctness benefit.
What the interviewer is really testing
Dual-serving architecture. The interviewer wants to see an offline store (data lake/warehouse for batch) and an online store (Redis/DynamoDB for real-time) with a shared feature registry.
Approach hint
Batch pipeline computes features and writes to both the offline store (Parquet/Delta) and the online store (Redis). Feature registry tracks feature definitions, versions, and SLAs. Point-in-time correctness for training: join features as of the label timestamp, not the current value. Mention training-serving skew as the biggest risk.
10 questions. Sorted by difficulty within each domain. Every question links to a solvable problem on DataDriven with full AI grading.
What the interviewer is really testing
Systematic Spark debugging, not guesswork. The interviewer wants to see you check the Spark UI, identify the failing stage, and reason about data skew vs insufficient memory.
Approach hint
Check the Spark UI for the failing stage. Look at task durations: if most tasks finish in 10 seconds but a few take 30 minutes, you have skew. If all tasks are slow, you need more memory or fewer partitions. Common fixes: salting skewed keys, increasing spark.sql.shuffle.partitions, using broadcast joins for small tables.
What the interviewer is really testing
Shuffle awareness. coalesce() avoids a full shuffle by combining partitions on existing executors. repartition() triggers a full shuffle. The interviewer wants to know if you understand the performance implications.
Approach hint
Use coalesce(N) when reducing partitions (e.g., before writing to fewer output files). Use repartition(N) when increasing partitions or when you need even distribution (e.g., before a join on a skewed key). Never coalesce to a much larger number, it won't work because coalesce can only decrease.
What the interviewer is really testing
Delta Lake MERGE syntax and SCD2 logic. The interviewer wants to see you handle inserts (new records), updates (changed records with history preservation), and unchanged records.
Approach hint
Use Delta Lake's MERGE INTO with a WHEN MATCHED AND changed THEN UPDATE (set end_date, is_current = false), plus a WHEN NOT MATCHED THEN INSERT. After the merge, insert new current rows for the changed records. Mention that this is a two-step operation: expire old rows, then insert new current rows.
What the interviewer is really testing
Join optimization strategies beyond broadcast. The default broadcast threshold is 10MB. The interviewer wants to see bucket joins, pre-filtering the large table, or a different approach entirely.
Approach hint
Options: (1) filter the broadcast table to reduce its size below threshold, (2) use bucket joins if both tables are bucketed on the join key, (3) increase spark.sql.autoBroadcastJoinThreshold (risky with 12GB), (4) pre-aggregate the smaller table. Discuss memory implications of each option.
What the interviewer is really testing
Schema inference and explode/posexplode. The interviewer wants to see you handle arrays within arrays, NULL fields, and schema evolution across files.
Approach hint
Use spark.read.json with schema inference. Use explode() to flatten top-level arrays, then select nested fields with dot notation. For deeply nested arrays, chain multiple explode calls. Discuss schema evolution: use mergeSchema option and handle missing fields with coalesce() or when().
What the interviewer is really testing
Deep understanding of Spark internals. Data skew means one partition has disproportionately more data, causing one task to run for hours while others finish in seconds.
Approach hint
Approach 1: Salt the skewed key (append a random number 0 to N, replicate the other table N times, join, then remove the salt). Approach 2: Adaptive Query Execution (AQE) in Spark 3.x automatically handles skew. Approach 3: Isolate the skewed key, broadcast-join it separately, union with the normal join result.
What the interviewer is really testing
End-to-end streaming architecture. The interviewer wants to see watermarking, output mode selection, checkpoint configuration, and trigger interval choices.
Approach hint
readStream from Kafka, withWatermark on event_time, groupBy(window(event_time, '5 minutes')), aggregate. writeStream with outputMode('append') (not complete, because append drops late data after the watermark). Set checkpointLocation for exactly-once. Discuss trigger options: processingTime vs availableNow.
What the interviewer is really testing
Testing discipline. Many Spark developers don't test their transformations. The interviewer wants to see you create a SparkSession in the test, build small input DataFrames, run the function, and assert on the output.
Approach hint
Use pytest with a session-scoped SparkSession fixture. Create input DataFrames with spark.createDataFrame(). Call your transformation function. Assert on output.collect() results. Test edge cases: empty DataFrames, NULLs, single-row inputs. Mention chispa or spark-testing-base libraries for DataFrame comparison.
What the interviewer is really testing
The small file problem. Many small files hurt read performance (too many S3 LIST calls, too many tasks to open files). The interviewer wants practical solutions.
Approach hint
Coalesce before writing to reduce output partitions. Use repartition(N) where N targets files of 128MB to 256MB each. For Delta Lake, run OPTIMIZE to compact small files after writing. Discuss why this matters: downstream jobs reading 10,000 files spend more time on I/O overhead than actual computation.
What the interviewer is really testing
Whether you understand what happens between writing a query and running it. The interviewer wants to hear about logical plan, physical plan, predicate pushdown, and column pruning.
Approach hint
Catalyst stages: parsing (SQL to unresolved logical plan), analysis (resolve names), optimization (predicate pushdown, column pruning, constant folding, join reordering), physical planning (choose join strategy, scan method). Use explain(true) to see all stages. Mention that Tungsten handles the code generation for the final execution.
Most question lists on the internet fall into two traps. Either they list vague questions ("Tell me about a time you worked with data") with no technical depth, or they list hyper-specific questions ("Write a query to find the Nth highest salary") that nobody actually asks in 2026. The questions on this page avoid both traps.
They reflect real interview patterns. We analyzed 1,042 verified data engineering interview rounds across 275 companies to identify the question patterns that appear most frequently. Deduplication with ROW_NUMBER, sessionization with LAG, schema design for a two-sided marketplace: these patterns recur across companies because they test the skills that matter in production.
They include the hidden evaluation criteria. When an interviewer asks you to "deduplicate a table," they're not just testing whether you know ROW_NUMBER. They're testing whether you ask about the composite key, whether you handle ties, whether you mention that this is an idempotent pipeline pattern. The "what the interviewer is really testing" field gives you this meta-knowledge.
They cover all 5 domains. Most question lists focus on SQL only. Data engineering interviews in 2026 test five distinct domains: SQL (41%), Python (35%), Data Modeling (18%), Pipeline Architecture (3%), and Spark (for senior and specialized roles). Skipping any domain leaves a gap that interviewers will find.
They link to executable practice. Reading questions and hints is step one. Solving them in a real coding environment with AI feedback is step two. On DataDriven, every question runs against a real database (for SQL), real Python (for Python), and real PySpark (for Spark). Your code actually runs, and the AI grader checks correctness, flags edge cases you missed, and gives line-by-line feedback on your code.
Your practice time should mirror the frequency of each domain in real interviews. Here is the breakdown from our analysis of 1,042 verified rounds, along with the recommended number of practice questions for an 8-week prep plan.
Recommended: 40 to 50 questions
Recommended: 20 to 30 questions
Recommended: 10 to 15 exercises
Recommended: 8 to 10 designs
Recommended: 10 to 15 questions
Practicing individual questions builds skills. Practicing full mock interviews builds stamina, time management, and the ability to perform under pressure. Here is how to transition from question practice to mock interview practice.
Weeks 1 to 4: question practice. Solve individual questions by domain. Focus on understanding the patterns, not memorizing solutions. When you can solve Medium SQL questions in under 15 minutes and Medium Python questions in under 20 minutes, you are ready for the next phase.
Weeks 5 to 6: timed rounds. Simulate single interview rounds. Set a 45-minute timer. Pick 2 to 3 questions from a single domain and solve them sequentially, explaining your thought process out loud. Grade yourself on correctness, speed, and communication quality.
Weeks 7 to 8: full mock interviews. Run complete mock interview loops. One SQL round (45 min), one Python round (45 min), one system design round (45 min), and one behavioral round (30 min). Take 15-minute breaks between rounds, just like a real onsite. DataDriven's mock interview simulator automates this: it selects questions matching your target company and level, times each round, and gives AI feedback at the end.
Every question includes a full solution, real code execution, and AI grading with line-by-line feedback. Practice the way the interview works.