50 Data Engineer Mock Interview Questions
50 mock interview questions sourced from verified data engineering interview rounds. Each entry names the companies that have asked similar prompts, what the interviewer is actually grading on, and an approach hint. Together they cover the 5 domains a real loop tests: SQL, Python, data modeling, pipeline architecture, and Spark.
50 mock interview questions sourced from real data engineering loops. SQL (10), Python (10), data modeling (8), pipeline architecture (8), and Spark (10). Each one names the companies it's been reported at, what the interviewer is actually assessing, and an approach hint. Free, no signup required to read.
How the 50 questions are distributed
Bars show how often each domain comes up in a real loop, weighted from verified interview reports. Spend time roughly in the same shape.
- SQL10 questions~41% of loops
Window functions, CTEs, dedup, sessionization, recursive queries, conditional aggregation.
- Python10 questions~35% of loops
Memory-efficient processing, error handling, decorators, concurrency, event-window joins.
- Data Modeling8 questions~18% of loops
Grain, SCD types, many-to-many bridges, graph models, scale tradeoffs.
- Pipeline Architecture8 questions~12% of loops
Streaming vs batch, migration planning, connector frameworks, SLA optimization, data quality.
- Spark10 questions~8% of loops
Memory debugging, join optimization, SCD2 merges, streaming with exactly-once, Catalyst.
SQL mock interview questions
10 questions. Each names the companies that ask similar prompts, what the interviewer is actually assessing, and an approach hint.
Given an orders table and a customers table, find customers who placed orders in every month of 2024.
What the interviewer is testing: Whether you can think in sets rather than loops. The interviewer wants to see if you reach for HAVING COUNT(DISTINCT month) = 12 or if you try to write 12 separate conditions. Approach hint: Extract the month from order_date with DATE_TRUNC, COUNT DISTINCT months per customer, filter with HAVING. A CTE keeps it readable.
Write a query to find the median session duration from a sessions table with millions of rows.
What the interviewer is testing: Your awareness of percentile functions and performance. Many candidates try to sort all rows and pick the middle one, which is O(n log n). The interviewer wants to see PERCENTILE_CONT or a ROW_NUMBER approach. Approach hint: Use PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY duration) if the database supports it. Otherwise, ROW_NUMBER with a total count and filter for the middle row(s). Mention that exact median on billions of rows may require approximate methods.
Deduplicate a table with no primary key, keeping only the row with the latest updated_at per composite key (user_id, product_id).
What the interviewer is testing: ROW_NUMBER partitioning. This is the single most common SQL pattern in DE interviews. The interviewer is checking if you can write it from memory in under 3 minutes. Approach hint: ROW_NUMBER() OVER (PARTITION BY user_id, product_id ORDER BY updated_at DESC) in a CTE. Filter WHERE rn = 1. Mention that in production you would use this inside a MERGE or DELETE statement.
Calculate the running total of daily revenue, resetting at the start of each month.
What the interviewer is testing: Window function framing with PARTITION BY. The reset-per-month requirement catches candidates who forget to partition by the month. Approach hint: SUM(revenue) OVER (PARTITION BY DATE_TRUNC('month', day) ORDER BY day ROWS UNBOUNDED PRECEDING). Walk through why ROWS matters here instead of RANGE.
Given a user_logins table, find users whose login count dropped by more than 50% from one week to the next.
What the interviewer is testing: LAG with aggregation. You need to aggregate first (weekly counts), then apply the window function. Many candidates try to do both in a single query and get confused. Approach hint: CTE 1: aggregate to weekly login counts per user. CTE 2: LAG(login_count) OVER (PARTITION BY user_id ORDER BY week). Final query: filter where current < 0.5 * previous.
Write a recursive CTE to build a bill of materials (BOM) explosion, showing all components of a finished product and their total quantities.
What the interviewer is testing: Recursive CTEs with accumulation. The quantity multiplication across levels separates strong candidates from average ones. Approach hint: Base case: direct components of the root product with their quantities. Recursive step: join to the BOM table and multiply parent quantity by child quantity. Sum at the end to get total quantities per leaf component.
Find the first purchase date for each customer and the product they bought on that date. Handle ties (same customer, same date, multiple products).
What the interviewer is testing: JOIN logic and how you handle ties. The interviewer wants to see if you address the ambiguity: do you pick one product arbitrarily (ROW_NUMBER) or return all products (RANK)? Approach hint: Ask the interviewer how to handle ties before writing code. Then use RANK or ROW_NUMBER accordingly, partitioned by customer_id and ordered by purchase_date.
Write a query to detect gaps in a sequence of invoice numbers for a given vendor.
What the interviewer is testing: LEAD/LAG to compare adjacent rows. This is a classic gap detection problem that shows up in financial data engineering interviews. Approach hint: Order by invoice_number, use LEAD(invoice_number) OVER (ORDER BY invoice_number). Where LEAD minus current is greater than 1, you have a gap. Output the start and end of each gap.
Given a table of page views with user_id and timestamp, assign session IDs using a 30-minute inactivity timeout.
What the interviewer is testing: Sessionization is one of the most practical DE SQL problems. The interviewer is checking whether you can use LAG to detect gaps, then SUM a boolean flag to create session IDs. Approach hint: LAG(timestamp) OVER (PARTITION BY user_id ORDER BY timestamp). Flag rows where the gap exceeds 30 minutes. SUM the flags as a running total to create session IDs.
Pivot a table of (student_id, subject, score) into columns: one column per subject.
What the interviewer is testing: Conditional aggregation. If you reach for PIVOT syntax, the interviewer may ask you to do it with CASE WHEN instead, since not all databases support PIVOT. Approach hint: MAX(CASE WHEN subject = 'Math' THEN score END) AS math_score. Group by student_id. Mention that this approach works across all SQL dialects.
Python mock interview questions
10 questions on memory-efficient processing, error handling, decorators, concurrency, and event-window joins.
Write a function that reads a 50GB CSV file and returns the top 10 rows by a numeric column without loading the entire file into memory.
What the interviewer is testing: Memory-efficient processing with generators or heapq. The interviewer is filtering for candidates who understand that pandas.read_csv on 50GB will crash. Approach hint: Use a min-heap of size 10 (heapq.nsmallest or nlargest). Read the file line by line or in chunks with csv.reader. Never call pd.read_csv() without chunksize on a 50GB file.
Implement a function that validates incoming records against a schema and routes invalid records to a dead letter queue.
What the interviewer is testing: Error handling patterns in data pipelines. The interviewer wants to see try/except with specific exceptions, not a bare except that swallows everything. Approach hint: Define expected types and required fields. Iterate records, validate each field, collect errors per record. Valid records go to output, invalid records go to a DLQ list with the error message attached. Return both lists.
Write a decorator that retries a function up to 3 times with exponential backoff and jitter.
What the interviewer is testing: Decorator syntax and production patterns. Jitter is the key detail that separates textbook answers from real-world experience. Approach hint: Inner function catches exceptions, sleeps for (2 ** attempt) + random.uniform(0, 1) seconds, retries. After max retries, re-raise the last exception. Mention functools.wraps to preserve the original function metadata.
Given a list of dictionaries with nested structures of varying depth, flatten each dictionary into a single-level dictionary with dot-separated keys.
What the interviewer is testing: Recursion and string manipulation. This is one of the most common Python DE questions. The interviewer may add follow-up constraints: handle lists, handle None values, limit depth. Approach hint: Recursive function that takes a dictionary and a prefix. For each key, if the value is a dict, recurse with prefix + key + '.'. Otherwise, add to the result. Handle edge cases: empty dicts, None values.
Implement a producer-consumer pipeline using Python's queue.Queue and threading.
What the interviewer is testing: Concurrency fundamentals. The interviewer wants to see proper use of Queue (thread-safe), a poison pill or sentinel to signal shutdown, and clean thread cleanup. Approach hint: Producer thread reads data and calls queue.put(). Consumer thread loops on queue.get() until it receives a sentinel value. Use threading.Thread with daemon=True. Put the sentinel after all data is produced.
Write a function that computes a hash-based checksum for a DataFrame to detect when source data has changed between pipeline runs.
What the interviewer is testing: Data integrity and change detection. This is a practical pipeline problem, not an algorithm puzzle. The interviewer wants to see that you think about column ordering, NULL handling, and type consistency. Approach hint: Sort columns alphabetically, convert each row to a canonical string representation, hash with hashlib.sha256. Combine row hashes into a single checksum. Mention that in production, you might hash a sample for performance.
Parse a 2GB Apache access log file and output the top 100 URLs by request count, grouped by HTTP status code.
What the interviewer is testing: File processing at scale and the collections module. The interviewer expects you to use Counter or defaultdict, not a hand-rolled counting loop. Approach hint: Read line by line. Parse with a regex or split(). Use collections.defaultdict(Counter) keyed by status code. After processing, extract top 100 from each status code's Counter with most_common(100).
Implement a simple rate limiter class that allows N requests per second using a sliding window.
What the interviewer is testing: Time-based data structures and the deque module. A fixed window is easy; a sliding window requires tracking individual request timestamps. Approach hint: Use collections.deque to store timestamps. On each request, remove timestamps older than 1 second from the left. If len(deque) < N, allow the request and append the current timestamp. Otherwise, deny.
Write a function that merges two sorted iterators into a single sorted iterator without materializing either one into a list.
What the interviewer is testing: Iterator protocol and lazy evaluation. The constraint against materializing is the key part. The interviewer wants a generator function. Approach hint: Use heapq.merge, or implement manually: peek at the head of each iterator, yield the smaller one, advance that iterator. Handle exhaustion of one iterator by yielding the rest of the other.
Given an event stream as a list of (timestamp, event_type, user_id) tuples, compute the conversion rate from 'view' to 'purchase' within a 24-hour attribution window.
What the interviewer is testing: Time-window joins in Python. This is the Python equivalent of an interval join. The interviewer wants to see efficient lookup (sorted + bisect or a dict of sorted timestamps) rather than O(n^2) nested loops. Approach hint: Group events by user. For each user, sort views and purchases by timestamp. For each view, binary search purchases within 24 hours. Count attributed conversions. Conversion rate = attributed views / total views.
Data modeling mock interview questions
8 questions on grain selection, SCD types, many-to-many relationships, graph modeling, and scale considerations.
Design a data model for a food delivery platform (restaurants, menus, orders, deliveries, ratings).
What the interviewer is testing: Grain selection and SCD handling. The tricky part is modeling menu item prices that change over time: do you snapshot the price at order time, or reference a slowly changing dimension? Approach hint: Fact table: order_items at one row per item per order. Store the price at time of order in the fact table (denormalize). Dimensions: dim_restaurant, dim_menu_item, dim_customer, dim_driver, dim_date. Discuss why the price snapshot belongs in the fact table.
Model a subscription SaaS product with free trials, upgrades, downgrades, and cancellations.
What the interviewer is testing: SCD Type 2 for subscription state transitions. The interviewer wants to see how you track the full lifecycle of a subscription: trial start, conversion, plan changes, churn. Approach hint: SCD Type 2 on dim_subscription with effective_date, end_date, is_current. Fact table: subscription_events (event_type, old_plan, new_plan, mrr_change). This lets analysts compute MRR, churn rate, and upgrade paths.
Design a star schema for a social media platform's engagement analytics.
What the interviewer is testing: Whether you can identify the right grain when there are multiple event types (likes, comments, shares, views). One fact table or multiple? Approach hint: One fact table: engagement_events with event_type as a degenerate dimension. Grain: one row per engagement event. Dimensions: dim_user (who engaged), dim_content (what was engaged with), dim_content_creator, dim_date. Discuss the tradeoff of one wide fact table vs separate fact tables per event type.
Model a data pipeline lineage system that tracks which tables feed into which downstream tables.
What the interviewer is testing: Graph modeling in a relational database. Lineage is a DAG, and the interviewer wants to see if you can model nodes and edges cleanly. Approach hint: Tables: datasets (id, name, schema, owner) and lineage_edges (source_dataset_id, target_dataset_id, pipeline_id, last_run_at). Use a recursive CTE to traverse upstream or downstream. Discuss how to handle schema-level lineage (column-to-column) vs table-level.
You have a dimension table with 500 million rows. How do you handle slowly changing dimensions at this scale?
What the interviewer is testing: Performance awareness in data modeling. SCD Type 2 on 500M rows requires specific strategies: partitioning, incremental processing, hash-based change detection. Approach hint: Hash all non-key columns into a checksum. Compare incoming records by hash, not by individual columns. Process only changed records. Partition the dimension by a date range or business key range. Mention that a full table scan comparison on 500M rows is not viable in production.
Design an analytics schema for a two-sided marketplace (buyers and sellers) with transactions, disputes, and payouts.
What the interviewer is testing: Role-playing dimensions. Both buyers and sellers are people, but they play different roles. The interviewer wants to see how you model this cleanly. Approach hint: One dim_user table with buyer and seller as foreign keys in the fact table: fact_transactions(buyer_id, seller_id, ...). Or create role-specific views. Discuss the tradeoff: single dimension with role-playing aliases vs duplicated dimensions.
When would you choose a One Big Table (OBT) over a star schema?
What the interviewer is testing: Pragmatism vs theory. The interviewer wants to see that you know when denormalization wins: when the dimension cardinality is low, joins are expensive, and the consumer is a BI tool that performs better on flat tables. Approach hint: OBT works well for small to medium datasets with low-cardinality dimensions, where the primary consumer is a BI tool. Star schema is better when you have high-cardinality dimensions, need to update dimension attributes without rewriting the fact table, or serve multiple use cases from the same model.
Model a healthcare system with patients, providers, encounters, diagnoses, and prescriptions.
What the interviewer is testing: Many-to-many relationships and bridge tables. A single encounter can have multiple diagnoses, and a single diagnosis can span multiple encounters. The interviewer checks if you know to use a bridge table. Approach hint: Fact: encounters. Bridge table: encounter_diagnoses (encounter_id, diagnosis_id, is_primary). Dimensions: dim_patient, dim_provider, dim_diagnosis (ICD-10 codes), dim_date. Discuss HIPAA considerations for PII in dimension tables.
Pipeline architecture mock interview questions
8 questions on streaming vs batch, migration planning, connector frameworks, SLA optimization, and data quality.
Design a real-time fraud detection pipeline that processes credit card transactions with sub-second latency.
What the interviewer is testing: Stream processing architecture and latency requirements. The interviewer wants to hear Kafka plus Flink (or Kafka Streams), not a batch solution with 5-minute intervals. Approach hint: Kafka ingestion, Flink for real-time feature computation (rolling transaction count, velocity checks), ML model scoring in the stream, output to a fast lookup store (Redis) and alert service. Discuss exactly-once semantics and what happens when the model service is down.
You need to migrate a 20TB data warehouse from on-prem Oracle to Snowflake with zero downtime. Design the migration plan.
What the interviewer is testing: Migration planning and dual-write patterns. The interviewer wants a phased approach, not a big-bang cutover. Approach hint: Phase 1: historical backfill using bulk export/import. Phase 2: CDC (Change Data Capture) to stream ongoing changes to both Oracle and Snowflake. Phase 3: validation (row counts, checksums, query result comparison). Phase 4: cutover with rollback plan. Discuss how to handle in-flight transactions during cutover.
Design a pipeline that ingests data from 50 different SaaS APIs with different rate limits, schemas, and authentication methods.
What the interviewer is testing: Abstraction and configuration-driven design. The interviewer wants to see a connector framework, not 50 separate scripts. Approach hint: Abstract connector interface with methods: authenticate(), fetch_schema(), extract_batch(), get_state(). Configuration per source: API endpoint, auth type, rate limit, schema mapping. Orchestrate with Airflow, one DAG per source. State management for incremental extraction. Centralized error handling and monitoring.
Your daily pipeline takes 6 hours to run but needs to finish in 2 hours to meet the SLA. How do you optimize it?
What the interviewer is testing: Systematic performance debugging. The interviewer does not want you to guess. They want a methodology: profile, identify bottlenecks, fix the biggest one first. Approach hint: Profile each task's duration. Identify the top 3 bottlenecks. Common fixes: parallelize independent tasks, optimize the slowest SQL queries (missing indexes, bad join order, unnecessary full scans), increase compute resources for the bottleneck step, switch from full refresh to incremental loads. Quantify expected improvement for each fix.
Design a data quality monitoring system that catches issues before bad data reaches production dashboards.
What the interviewer is testing: Proactive data quality thinking. The interviewer wants to see specific checks (not just 'we test the data'), alerting logic, and what happens when a check fails. Approach hint: Checks: row count within expected range, NULL rate per column below threshold, referential integrity (no orphan foreign keys), freshness (data arrived within SLA), distribution drift (sudden spikes or drops). Run checks after each pipeline stage. On failure: block downstream processing, alert on-call, log to a data quality dashboard.
Compare Airflow, Dagster, and Prefect for orchestrating a data platform with 200 DAGs.
What the interviewer is testing: Whether you have real opinions based on experience, not just what you read in a blog post. The interviewer wants tradeoffs, not a feature checklist. Approach hint: Airflow: mature ecosystem, huge community, but DAG serialization and scheduler performance degrade at scale. Dagster: asset-centric model is great for data-aware orchestration, but smaller community. Prefect: Pythonic API, good for dynamic workflows, but less battle-tested at 200+ DAG scale. Mention what you have actually used and what problems you hit.
Design a pipeline that processes 1 billion events per day from IoT sensors, with late-arriving data up to 72 hours after the event.
What the interviewer is testing: Late data handling and partition management. The 72-hour late arrival window means you cannot close partitions after one day. The interviewer wants to see watermarking or partition reprocessing strategies. Approach hint: Partition by event_time, not ingestion_time. Keep partitions open for reprocessing for 72 hours. Use watermarks in the streaming layer. For the batch layer, reprocess the last 3 days on every run (idempotent overwrites). Discuss the storage cost of reprocessing vs the correctness benefit.
How would you implement a feature store that serves both batch training and real-time inference?
What the interviewer is testing: Dual-serving architecture. The interviewer wants to see an offline store (data lake/warehouse for batch) and an online store (Redis/DynamoDB for real-time) with a shared feature registry. Approach hint: Batch pipeline computes features and writes to both the offline store (Parquet/Delta) and the online store (Redis). Feature registry tracks feature definitions, versions, and SLAs. Point-in-time correctness for training: join features as of the label timestamp, not the current value. Mention training-serving skew as the biggest risk.
Spark mock interview questions
10 questions on memory debugging, join optimization, SCD2 merges, streaming with exactly-once, and the Catalyst optimizer.
You have a Spark job that runs out of memory on the executor nodes. Walk me through your debugging process.
What the interviewer is testing: Systematic Spark debugging, not guesswork. The interviewer wants to see you check the Spark UI, identify the failing stage, and reason about data skew vs insufficient memory. Approach hint: Check the Spark UI for the failing stage. Look at task durations: if most tasks finish in 10 seconds but a few take 30 minutes, you have skew. If all tasks are slow, you need more memory or fewer partitions. Common fixes: salting skewed keys, increasing spark.sql.shuffle.partitions, using broadcast joins for small tables.
Explain the difference between repartition() and coalesce() and when you would use each.
What the interviewer is testing: Shuffle awareness. coalesce() avoids a full shuffle by combining partitions on existing executors. repartition() triggers a full shuffle. The interviewer wants to know if you understand the performance implications. Approach hint: Use coalesce(N) when reducing partitions (e.g., before writing to fewer output files). Use repartition(N) when increasing partitions or when you need even distribution (e.g., before a join on a skewed key). Never coalesce to a much larger number, it won't work because coalesce can only decrease.
Write a PySpark job that performs a slowly changing dimension Type 2 merge on a Delta Lake table.
What the interviewer is testing: Delta Lake MERGE syntax and SCD2 logic. The interviewer wants to see you handle inserts (new records), updates (changed records with history preservation), and unchanged records. Approach hint: Use Delta Lake's MERGE INTO with a WHEN MATCHED AND changed THEN UPDATE (set end_date, is_current = false), plus a WHEN NOT MATCHED THEN INSERT. After the merge, insert new current rows for the changed records. Mention that this is a two-step operation: expire old rows, then insert new current rows.
A broadcast join is failing because the broadcast table is 12GB. How do you handle this?
What the interviewer is testing: Join optimization strategies beyond broadcast. The default broadcast threshold is 10MB. The interviewer wants to see bucket joins, pre-filtering the large table, or a different approach entirely. Approach hint: Options: (1) filter the broadcast table to reduce its size below threshold, (2) use bucket joins if both tables are bucketed on the join key, (3) increase spark.sql.autoBroadcastJoinThreshold (risky with 12GB), (4) pre-aggregate the smaller table. Discuss memory implications of each option.
Design a PySpark pipeline that processes semi-structured JSON with deeply nested arrays and outputs a flattened, analytics-ready table.
What the interviewer is testing: Schema inference and explode/posexplode. The interviewer wants to see you handle arrays within arrays, NULL fields, and schema evolution across files. Approach hint: Use spark.read.json with schema inference. Use explode() to flatten top-level arrays, then select nested fields with dot notation. For deeply nested arrays, chain multiple explode calls. Discuss schema evolution: use mergeSchema option and handle missing fields with coalesce() or when().
Explain how Spark handles data skew in a shuffle join and what you would do about it.
What the interviewer is testing: Deep understanding of Spark internals. Data skew means one partition has disproportionately more data, causing one task to run for hours while others finish in seconds. Approach hint: Approach 1: Salt the skewed key (append a random number 0 to N, replicate the other table N times, join, then remove the salt). Approach 2: Adaptive Query Execution (AQE) in Spark 3.x automatically handles skew. Approach 3: Isolate the skewed key, broadcast-join it separately, union with the normal join result.
Write a Spark Structured Streaming job that reads from Kafka, applies a tumbling window aggregation, and writes to Delta Lake with exactly-once semantics.
What the interviewer is testing: End-to-end streaming architecture. The interviewer wants to see watermarking, output mode selection, checkpoint configuration, and trigger interval choices. Approach hint: readStream from Kafka, withWatermark on event_time, groupBy(window(event_time, '5 minutes')), aggregate. writeStream with outputMode('append') (not complete, because append drops late data after the watermark). Set checkpointLocation for exactly-once. Discuss trigger options: processingTime vs availableNow.
How would you unit test a PySpark transformation function?
What the interviewer is testing: Testing discipline. Many Spark developers don't test their transformations. The interviewer wants to see you create a SparkSession in the test, build small input DataFrames, run the function, and assert on the output. Approach hint: Use pytest with a session-scoped SparkSession fixture. Create input DataFrames with spark.createDataFrame(). Call your transformation function. Assert on output.collect() results. Test edge cases: empty DataFrames, NULLs, single-row inputs. Mention chispa or spark-testing-base libraries for DataFrame comparison.
Your Spark job writes 10,000 small files to S3. What is the problem, and how do you fix it?
What the interviewer is testing: The small file problem. Many small files hurt read performance (too many S3 LIST calls, too many tasks to open files). The interviewer wants practical solutions. Approach hint: Coalesce before writing to reduce output partitions. Use repartition(N) where N targets files of 128MB to 256MB each. For Delta Lake, run OPTIMIZE to compact small files after writing. Discuss why this matters: downstream jobs reading 10,000 files spend more time on I/O overhead than actual computation.
Explain the Catalyst optimizer and how it affects the execution of a Spark SQL query.
What the interviewer is testing: Whether you understand what happens between writing a query and running it. The interviewer wants to hear about logical plan, physical plan, predicate pushdown, and column pruning. Approach hint: Catalyst stages: parsing (SQL to unresolved logical plan), analysis (resolve names), optimization (predicate pushdown, column pruning, constant folding, join reordering), physical planning (choose join strategy, scan method). Use explain(true) to see all stages. Mention that Tungsten handles the code generation for the final execution.
Common questions
How many mock interview questions should I practice before a real interview?+
What difficulty level should I focus on for mock interview questions?+
Do these match what companies actually ask?+
Should I practice mock interview questions with a timer?+
How do mock interview questions differ from practice problems?+
Run a live mock against the same questions
- 01
Active recall beats re-reading by 50%
Cognitive-science meta-reviews (Dunlosky et al., 2013) rank practice testing as a top-tier study technique, while re-reading and highlighting rank near the bottom
- 02
76% of hiring managers reject on the coding task, not the resume
From HackerRank's 2024 Developer Skills Report. Candidates who look strong on paper still fail the live screen if they haven't done timed, executable practice
- 03
Five problem shapes cover 80% of data engineer loops
Dedup, sessionization, top-N-per-group, slowly-changing dimensions, partition tricks. Writing the shapes by hand turns the unfamiliar into pattern recognition