Mock Interview Question Bank

50 Data Engineer Mock Interview Questions by Domain

Every question below comes from verified interview rounds at real companies. We include the difficulty level, which companies ask similar questions, what the interviewer is actually evaluating, and a concise approach hint. These 50 questions cover all 5 data engineering domains: SQL, Python, Data Modeling, Pipeline Architecture, and Spark. DataDriven has 1,000+ more with full solutions, real code execution, and AI grading.

50

Questions on this page

5

Domains covered

1,000+

Total on DataDriven

275

Companies represented

How to Use This Mock Interview Question Bank

Each question below includes four pieces of information that most question lists skip. First, the companies field tells you which companies have asked similar questions in real interviews. Second, the difficulty tag (Easy, Medium, Hard) matches the grading scale used by FAANG interviewers. Third, the what the interviewer is really testing field reveals the hidden evaluation criteria behind the question. Most candidates answer the surface-level question without addressing the deeper skill the interviewer is probing. Fourth, the approach hint gives you a starting direction without spoiling the full solution.

Don't read these passively. For each question, spend 5 minutes thinking about your approach before reading the hint. Then open DataDriven and solve it with real code execution. The AI grader will tell you what you missed.

The distribution across domains mirrors real interview frequency. SQL gets the most questions (41% of real DE interviews are SQL). Spark gets a dedicated section because senior and Spark-specific roles test it heavily, even though it appears in a smaller percentage of overall interview loops.

41% of all DE interview questions

SQL Mock Interview Questions

10 questions. Sorted by difficulty within each domain. Every question links to a solvable problem on DataDriven with full AI grading.

Q1

Given an orders table and a customers table, find customers who placed orders in every month of 2024.

HardMeta, Stripe, Airbnb

What the interviewer is really testing

Whether you can think in sets rather than loops. The interviewer wants to see if you reach for HAVING COUNT(DISTINCT month) = 12 or if you try to write 12 separate conditions.

Approach hint

Extract the month from order_date with DATE_TRUNC, COUNT DISTINCT months per customer, filter with HAVING. A CTE keeps it readable.

Q2

Write a query to find the median session duration from a sessions table with millions of rows.

HardGoogle, Netflix, Spotify

What the interviewer is really testing

Your awareness of percentile functions and performance. Many candidates try to sort all rows and pick the middle one, which is O(n log n). The interviewer wants to see PERCENTILE_CONT or a ROW_NUMBER approach.

Approach hint

Use PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY duration) if the database supports it. Otherwise, ROW_NUMBER with a total count and filter for the middle row(s). Mention that exact median on billions of rows may require approximate methods.

Q3

Deduplicate a table with no primary key, keeping only the row with the latest updated_at per composite key (user_id, product_id).

MediumAmazon, Uber, DoorDash

What the interviewer is really testing

ROW_NUMBER partitioning. This is the single most common SQL pattern in DE interviews. The interviewer is checking if you can write it from memory in under 3 minutes.

Approach hint

ROW_NUMBER() OVER (PARTITION BY user_id, product_id ORDER BY updated_at DESC) in a CTE. Filter WHERE rn = 1. Mention that in production you would use this inside a MERGE or DELETE statement.

Q4

Calculate the running total of daily revenue, resetting at the start of each month.

MediumStripe, Square, Shopify

What the interviewer is really testing

Window function framing with PARTITION BY. The reset-per-month requirement catches candidates who forget to partition by the month.

Approach hint

SUM(revenue) OVER (PARTITION BY DATE_TRUNC('month', day) ORDER BY day ROWS UNBOUNDED PRECEDING). Walk through why ROWS matters here instead of RANGE.

Q5

Given a user_logins table, find users whose login count dropped by more than 50% from one week to the next.

MediumMeta, LinkedIn, Pinterest

What the interviewer is really testing

LAG with aggregation. You need to aggregate first (weekly counts), then apply the window function. Many candidates try to do both in a single query and get confused.

Approach hint

CTE 1: aggregate to weekly login counts per user. CTE 2: LAG(login_count) OVER (PARTITION BY user_id ORDER BY week). Final query: filter where current < 0.5 * previous.

Q6

Write a recursive CTE to build a bill of materials (BOM) explosion, showing all components of a finished product and their total quantities.

HardApple, Tesla, Amazon

What the interviewer is really testing

Recursive CTEs with accumulation. The quantity multiplication across levels separates strong candidates from average ones.

Approach hint

Base case: direct components of the root product with their quantities. Recursive step: join to the BOM table and multiply parent quantity by child quantity. Sum at the end to get total quantities per leaf component.

Q7

Find the first purchase date for each customer and the product they bought on that date. Handle ties (same customer, same date, multiple products).

EasyWalmart, Target, Instacart

What the interviewer is really testing

JOIN logic and how you handle ties. The interviewer wants to see if you address the ambiguity: do you pick one product arbitrarily (ROW_NUMBER) or return all products (RANK)?

Approach hint

Ask the interviewer how to handle ties before writing code. Then use RANK or ROW_NUMBER accordingly, partitioned by customer_id and ordered by purchase_date.

Q8

Write a query to detect gaps in a sequence of invoice numbers for a given vendor.

MediumIntuit, SAP, Oracle

What the interviewer is really testing

LEAD/LAG to compare adjacent rows. This is a classic gap detection problem that shows up in financial data engineering interviews.

Approach hint

Order by invoice_number, use LEAD(invoice_number) OVER (ORDER BY invoice_number). Where LEAD minus current is greater than 1, you have a gap. Output the start and end of each gap.

Q9

Given a table of page views with user_id and timestamp, assign session IDs using a 30-minute inactivity timeout.

HardGoogle Analytics, Amplitude, Snowflake

What the interviewer is really testing

Sessionization is one of the most practical DE SQL problems. The interviewer is checking whether you can use LAG to detect gaps, then SUM a boolean flag to create session IDs.

Approach hint

LAG(timestamp) OVER (PARTITION BY user_id ORDER BY timestamp). Flag rows where the gap exceeds 30 minutes. SUM the flags as a running total to create session IDs.

Q10

Pivot a table of (student_id, subject, score) into columns: one column per subject.

EasyMicrosoft, Salesforce, Databricks

What the interviewer is really testing

Conditional aggregation. If you reach for PIVOT syntax, the interviewer may ask you to do it with CASE WHEN instead, since not all databases support PIVOT.

Approach hint

MAX(CASE WHEN subject = 'Math' THEN score END) AS math_score. Group by student_id. Mention that this approach works across all SQL dialects.

35% of all DE interview questions

Python Mock Interview Questions

10 questions. Sorted by difficulty within each domain. Every question links to a solvable problem on DataDriven with full AI grading.

Q1

Write a function that reads a 50GB CSV file and returns the top 10 rows by a numeric column without loading the entire file into memory.

MediumNetflix, Spotify, Lyft

What the interviewer is really testing

Memory-efficient processing with generators or heapq. The interviewer is filtering for candidates who understand that pandas.read_csv on 50GB will crash.

Approach hint

Use a min-heap of size 10 (heapq.nsmallest or nlargest). Read the file line by line or in chunks with csv.reader. Never call pd.read_csv() without chunksize on a 50GB file.

Q2

Implement a function that validates incoming records against a schema and routes invalid records to a dead letter queue.

MediumUber, DoorDash, Instacart

What the interviewer is really testing

Error handling patterns in data pipelines. The interviewer wants to see try/except with specific exceptions, not a bare except that swallows everything.

Approach hint

Define expected types and required fields. Iterate records, validate each field, collect errors per record. Valid records go to output, invalid records go to a DLQ list with the error message attached. Return both lists.

Q3

Write a decorator that retries a function up to 3 times with exponential backoff and jitter.

MediumAmazon, Google, Stripe

What the interviewer is really testing

Decorator syntax and production patterns. Jitter is the key detail that separates textbook answers from real-world experience.

Approach hint

Inner function catches exceptions, sleeps for (2 ** attempt) + random.uniform(0, 1) seconds, retries. After max retries, re-raise the last exception. Mention functools.wraps to preserve the original function metadata.

Q4

Given a list of dictionaries with nested structures of varying depth, flatten each dictionary into a single-level dictionary with dot-separated keys.

EasyAirbnb, Stripe, Plaid

What the interviewer is really testing

Recursion and string manipulation. This is one of the most common Python DE questions. The interviewer may add follow-up constraints: handle lists, handle None values, limit depth.

Approach hint

Recursive function that takes a dictionary and a prefix. For each key, if the value is a dict, recurse with prefix + key + '.'. Otherwise, add to the result. Handle edge cases: empty dicts, None values.

Q5

Implement a producer-consumer pipeline using Python's queue.Queue and threading.

HardLinkedIn, Pinterest, Confluent

What the interviewer is really testing

Concurrency fundamentals. The interviewer wants to see proper use of Queue (thread-safe), a poison pill or sentinel to signal shutdown, and clean thread cleanup.

Approach hint

Producer thread reads data and calls queue.put(). Consumer thread loops on queue.get() until it receives a sentinel value. Use threading.Thread with daemon=True. Put the sentinel after all data is produced.

Q6

Write a function that computes a hash-based checksum for a DataFrame to detect when source data has changed between pipeline runs.

MediumDatabricks, dbt Labs, Fivetran

What the interviewer is really testing

Data integrity and change detection. This is a practical pipeline problem, not an algorithm puzzle. The interviewer wants to see that you think about column ordering, NULL handling, and type consistency.

Approach hint

Sort columns alphabetically, convert each row to a canonical string representation, hash with hashlib.sha256. Combine row hashes into a single checksum. Mention that in production, you might hash a sample for performance.

Q7

Parse a 2GB Apache access log file and output the top 100 URLs by request count, grouped by HTTP status code.

MediumCloudflare, Datadog, Splunk

What the interviewer is really testing

File processing at scale and the collections module. The interviewer expects you to use Counter or defaultdict, not a hand-rolled counting loop.

Approach hint

Read line by line. Parse with a regex or split(). Use collections.defaultdict(Counter) keyed by status code. After processing, extract top 100 from each status code's Counter with most_common(100).

Q8

Implement a simple rate limiter class that allows N requests per second using a sliding window.

HardStripe, Twilio, Cloudflare

What the interviewer is really testing

Time-based data structures and the deque module. A fixed window is easy; a sliding window requires tracking individual request timestamps.

Approach hint

Use collections.deque to store timestamps. On each request, remove timestamps older than 1 second from the left. If len(deque) < N, allow the request and append the current timestamp. Otherwise, deny.

Q9

Write a function that merges two sorted iterators into a single sorted iterator without materializing either one into a list.

MediumGoogle, Meta, Apple

What the interviewer is really testing

Iterator protocol and lazy evaluation. The constraint against materializing is the key part. The interviewer wants a generator function.

Approach hint

Use heapq.merge, or implement manually: peek at the head of each iterator, yield the smaller one, advance that iterator. Handle exhaustion of one iterator by yielding the rest of the other.

Q10

Given an event stream as a list of (timestamp, event_type, user_id) tuples, compute the conversion rate from 'view' to 'purchase' within a 24-hour attribution window.

HardMeta, Pinterest, Snapchat

What the interviewer is really testing

Time-window joins in Python. This is the Python equivalent of an interval join. The interviewer wants to see efficient lookup (sorted + bisect or a dict of sorted timestamps) rather than O(n^2) nested loops.

Approach hint

Group events by user. For each user, sort views and purchases by timestamp. For each view, binary search purchases within 24 hours. Count attributed conversions. Conversion rate = attributed views / total views.

18% of all DE interview questions

Data Modeling Mock Interview Questions

8 questions. Sorted by difficulty within each domain. Every question links to a solvable problem on DataDriven with full AI grading.

Q1

Design a data model for a food delivery platform (restaurants, menus, orders, deliveries, ratings).

MediumDoorDash, Uber Eats, Grubhub

What the interviewer is really testing

Grain selection and SCD handling. The tricky part is modeling menu item prices that change over time: do you snapshot the price at order time, or reference a slowly changing dimension?

Approach hint

Fact table: order_items at one row per item per order. Store the price at time of order in the fact table (denormalize). Dimensions: dim_restaurant, dim_menu_item, dim_customer, dim_driver, dim_date. Discuss why the price snapshot belongs in the fact table.

Q2

Model a subscription SaaS product with free trials, upgrades, downgrades, and cancellations.

HardSalesforce, HubSpot, Databricks

What the interviewer is really testing

SCD Type 2 for subscription state transitions. The interviewer wants to see how you track the full lifecycle of a subscription: trial start, conversion, plan changes, churn.

Approach hint

SCD Type 2 on dim_subscription with effective_date, end_date, is_current. Fact table: subscription_events (event_type, old_plan, new_plan, mrr_change). This lets analysts compute MRR, churn rate, and upgrade paths.

Q3

Design a star schema for a social media platform's engagement analytics.

MediumMeta, TikTok, LinkedIn

What the interviewer is really testing

Whether you can identify the right grain when there are multiple event types (likes, comments, shares, views). One fact table or multiple?

Approach hint

One fact table: engagement_events with event_type as a degenerate dimension. Grain: one row per engagement event. Dimensions: dim_user (who engaged), dim_content (what was engaged with), dim_content_creator, dim_date. Discuss the tradeoff of one wide fact table vs separate fact tables per event type.

Q4

Model a data pipeline lineage system that tracks which tables feed into which downstream tables.

HardAirbnb, Netflix, Spotify

What the interviewer is really testing

Graph modeling in a relational database. Lineage is a DAG, and the interviewer wants to see if you can model nodes and edges cleanly.

Approach hint

Tables: datasets (id, name, schema, owner) and lineage_edges (source_dataset_id, target_dataset_id, pipeline_id, last_run_at). Use a recursive CTE to traverse upstream or downstream. Discuss how to handle schema-level lineage (column-to-column) vs table-level.

Q5

You have a dimension table with 500 million rows. How do you handle slowly changing dimensions at this scale?

HardAmazon, Walmart, Target

What the interviewer is really testing

Performance awareness in data modeling. SCD Type 2 on 500M rows requires specific strategies: partitioning, incremental processing, hash-based change detection.

Approach hint

Hash all non-key columns into a checksum. Compare incoming records by hash, not by individual columns. Process only changed records. Partition the dimension by a date range or business key range. Mention that a full table scan comparison on 500M rows is not viable in production.

Q6

Design an analytics schema for a two-sided marketplace (buyers and sellers) with transactions, disputes, and payouts.

MediumeBay, Etsy, Mercari

What the interviewer is really testing

Role-playing dimensions. Both buyers and sellers are people, but they play different roles. The interviewer wants to see how you model this cleanly.

Approach hint

One dim_user table with buyer and seller as foreign keys in the fact table: fact_transactions(buyer_id, seller_id, ...). Or create role-specific views. Discuss the tradeoff: single dimension with role-playing aliases vs duplicated dimensions.

Q7

When would you choose a One Big Table (OBT) over a star schema?

Easydbt Labs, Snowflake, BigQuery teams

What the interviewer is really testing

Pragmatism vs theory. The interviewer wants to see that you know when denormalization wins: when the dimension cardinality is low, joins are expensive, and the consumer is a BI tool that performs better on flat tables.

Approach hint

OBT works well for small to medium datasets with low-cardinality dimensions, where the primary consumer is a BI tool. Star schema is better when you have high-cardinality dimensions, need to update dimension attributes without rewriting the fact table, or serve multiple use cases from the same model.

Q8

Model a healthcare system with patients, providers, encounters, diagnoses, and prescriptions.

HardEpic, Cerner, Optum

What the interviewer is really testing

Many-to-many relationships and bridge tables. A single encounter can have multiple diagnoses, and a single diagnosis can span multiple encounters. The interviewer checks if you know to use a bridge table.

Approach hint

Fact: encounters. Bridge table: encounter_diagnoses (encounter_id, diagnosis_id, is_primary). Dimensions: dim_patient, dim_provider, dim_diagnosis (ICD-10 codes), dim_date. Discuss HIPAA considerations for PII in dimension tables.

3% of all DE interview questions (senior roles)

Pipeline Architecture Mock Interview Questions

8 questions. Sorted by difficulty within each domain. Every question links to a solvable problem on DataDriven with full AI grading.

Q1

Design a real-time fraud detection pipeline that processes credit card transactions with sub-second latency.

HardStripe, Square, PayPal

What the interviewer is really testing

Stream processing architecture and latency requirements. The interviewer wants to hear Kafka plus Flink (or Kafka Streams), not a batch solution with 5-minute intervals.

Approach hint

Kafka ingestion, Flink for real-time feature computation (rolling transaction count, velocity checks), ML model scoring in the stream, output to a fast lookup store (Redis) and alert service. Discuss exactly-once semantics and what happens when the model service is down.

Q2

You need to migrate a 20TB data warehouse from on-prem Oracle to Snowflake with zero downtime. Design the migration plan.

HardAny enterprise company

What the interviewer is really testing

Migration planning and dual-write patterns. The interviewer wants a phased approach, not a big-bang cutover.

Approach hint

Phase 1: historical backfill using bulk export/import. Phase 2: CDC (Change Data Capture) to stream ongoing changes to both Oracle and Snowflake. Phase 3: validation (row counts, checksums, query result comparison). Phase 4: cutover with rollback plan. Discuss how to handle in-flight transactions during cutover.

Q3

Design a pipeline that ingests data from 50 different SaaS APIs with different rate limits, schemas, and authentication methods.

MediumFivetran, Airbyte, Stitch

What the interviewer is really testing

Abstraction and configuration-driven design. The interviewer wants to see a connector framework, not 50 separate scripts.

Approach hint

Abstract connector interface with methods: authenticate(), fetch_schema(), extract_batch(), get_state(). Configuration per source: API endpoint, auth type, rate limit, schema mapping. Orchestrate with Airflow, one DAG per source. State management for incremental extraction. Centralized error handling and monitoring.

Q4

Your daily pipeline takes 6 hours to run but needs to finish in 2 hours to meet the SLA. How do you optimize it?

MediumAmazon, Google, Meta

What the interviewer is really testing

Systematic performance debugging. The interviewer does not want you to guess. They want a methodology: profile, identify bottlenecks, fix the biggest one first.

Approach hint

Profile each task's duration. Identify the top 3 bottlenecks. Common fixes: parallelize independent tasks, optimize the slowest SQL queries (missing indexes, bad join order, unnecessary full scans), increase compute resources for the bottleneck step, switch from full refresh to incremental loads. Quantify expected improvement for each fix.

Q5

Design a data quality monitoring system that catches issues before bad data reaches production dashboards.

MediumNetflix, Airbnb, Uber

What the interviewer is really testing

Proactive data quality thinking. The interviewer wants to see specific checks (not just 'we test the data'), alerting logic, and what happens when a check fails.

Approach hint

Checks: row count within expected range, NULL rate per column below threshold, referential integrity (no orphan foreign keys), freshness (data arrived within SLA), distribution drift (sudden spikes or drops). Run checks after each pipeline stage. On failure: block downstream processing, alert on-call, log to a data quality dashboard.

Q6

Compare Airflow, Dagster, and Prefect for orchestrating a data platform with 200 DAGs.

MediumAny mid-to-large data team

What the interviewer is really testing

Whether you have real opinions based on experience, not just what you read in a blog post. The interviewer wants tradeoffs, not a feature checklist.

Approach hint

Airflow: mature ecosystem, huge community, but DAG serialization and scheduler performance degrade at scale. Dagster: asset-centric model is great for data-aware orchestration, but smaller community. Prefect: Pythonic API, good for dynamic workflows, but less battle-tested at 200+ DAG scale. Mention what you have actually used and what problems you hit.

Q7

Design a pipeline that processes 1 billion events per day from IoT sensors, with late-arriving data up to 72 hours after the event.

HardTesla, Nest, Siemens

What the interviewer is really testing

Late data handling and partition management. The 72-hour late arrival window means you can not close partitions after one day. The interviewer wants to see watermarking or partition reprocessing strategies.

Approach hint

Partition by event_time, not ingestion_time. Keep partitions open for reprocessing for 72 hours. Use watermarks in the streaming layer. For the batch layer, reprocess the last 3 days on every run (idempotent overwrites). Discuss the storage cost of reprocessing vs the correctness benefit.

Q8

How would you implement a feature store that serves both batch training and real-time inference?

HardUber, Lyft, DoorDash

What the interviewer is really testing

Dual-serving architecture. The interviewer wants to see an offline store (data lake/warehouse for batch) and an online store (Redis/DynamoDB for real-time) with a shared feature registry.

Approach hint

Batch pipeline computes features and writes to both the offline store (Parquet/Delta) and the online store (Redis). Feature registry tracks feature definitions, versions, and SLAs. Point-in-time correctness for training: join features as of the label timestamp, not the current value. Mention training-serving skew as the biggest risk.

Appears in senior and Spark-specific roles

Spark Mock Interview Questions

10 questions. Sorted by difficulty within each domain. Every question links to a solvable problem on DataDriven with full AI grading.

Q1

You have a Spark job that runs out of memory on the executor nodes. Walk me through your debugging process.

MediumDatabricks, Netflix, Uber

What the interviewer is really testing

Systematic Spark debugging, not guesswork. The interviewer wants to see you check the Spark UI, identify the failing stage, and reason about data skew vs insufficient memory.

Approach hint

Check the Spark UI for the failing stage. Look at task durations: if most tasks finish in 10 seconds but a few take 30 minutes, you have skew. If all tasks are slow, you need more memory or fewer partitions. Common fixes: salting skewed keys, increasing spark.sql.shuffle.partitions, using broadcast joins for small tables.

Q2

Explain the difference between repartition() and coalesce() and when you would use each.

EasyAny Spark interview

What the interviewer is really testing

Shuffle awareness. coalesce() avoids a full shuffle by combining partitions on existing executors. repartition() triggers a full shuffle. The interviewer wants to know if you understand the performance implications.

Approach hint

Use coalesce(N) when reducing partitions (e.g., before writing to fewer output files). Use repartition(N) when increasing partitions or when you need even distribution (e.g., before a join on a skewed key). Never coalesce to a much larger number, it won't work because coalesce can only decrease.

Q3

Write a PySpark job that performs a slowly changing dimension Type 2 merge on a Delta Lake table.

HardDatabricks, Amazon, Microsoft

What the interviewer is really testing

Delta Lake MERGE syntax and SCD2 logic. The interviewer wants to see you handle inserts (new records), updates (changed records with history preservation), and unchanged records.

Approach hint

Use Delta Lake's MERGE INTO with a WHEN MATCHED AND changed THEN UPDATE (set end_date, is_current = false), plus a WHEN NOT MATCHED THEN INSERT. After the merge, insert new current rows for the changed records. Mention that this is a two-step operation: expire old rows, then insert new current rows.

Q4

A broadcast join is failing because the broadcast table is 12GB. How do you handle this?

MediumNetflix, Uber, Lyft

What the interviewer is really testing

Join optimization strategies beyond broadcast. The default broadcast threshold is 10MB. The interviewer wants to see bucket joins, pre-filtering the large table, or a different approach entirely.

Approach hint

Options: (1) filter the broadcast table to reduce its size below threshold, (2) use bucket joins if both tables are bucketed on the join key, (3) increase spark.sql.autoBroadcastJoinThreshold (risky with 12GB), (4) pre-aggregate the smaller table. Discuss memory implications of each option.

Q5

Design a PySpark pipeline that processes semi-structured JSON with deeply nested arrays and outputs a flattened, analytics-ready table.

MediumAirbnb, Stripe, Snowflake

What the interviewer is really testing

Schema inference and explode/posexplode. The interviewer wants to see you handle arrays within arrays, NULL fields, and schema evolution across files.

Approach hint

Use spark.read.json with schema inference. Use explode() to flatten top-level arrays, then select nested fields with dot notation. For deeply nested arrays, chain multiple explode calls. Discuss schema evolution: use mergeSchema option and handle missing fields with coalesce() or when().

Q6

Explain how Spark handles data skew in a shuffle join and what you would do about it.

HardDatabricks, Google, Meta

What the interviewer is really testing

Deep understanding of Spark internals. Data skew means one partition has disproportionately more data, causing one task to run for hours while others finish in seconds.

Approach hint

Approach 1: Salt the skewed key (append a random number 0 to N, replicate the other table N times, join, then remove the salt). Approach 2: Adaptive Query Execution (AQE) in Spark 3.x automatically handles skew. Approach 3: Isolate the skewed key, broadcast-join it separately, union with the normal join result.

Q7

Write a Spark Structured Streaming job that reads from Kafka, applies a tumbling window aggregation, and writes to Delta Lake with exactly-once semantics.

HardConfluent, Databricks, Uber

What the interviewer is really testing

End-to-end streaming architecture. The interviewer wants to see watermarking, output mode selection, checkpoint configuration, and trigger interval choices.

Approach hint

readStream from Kafka, withWatermark on event_time, groupBy(window(event_time, '5 minutes')), aggregate. writeStream with outputMode('append') (not complete, because append drops late data after the watermark). Set checkpointLocation for exactly-once. Discuss trigger options: processingTime vs availableNow.

Q8

How would you unit test a PySpark transformation function?

EasyAny Spark team with mature practices

What the interviewer is really testing

Testing discipline. Many Spark developers don't test their transformations. The interviewer wants to see you create a SparkSession in the test, build small input DataFrames, run the function, and assert on the output.

Approach hint

Use pytest with a session-scoped SparkSession fixture. Create input DataFrames with spark.createDataFrame(). Call your transformation function. Assert on output.collect() results. Test edge cases: empty DataFrames, NULLs, single-row inputs. Mention chispa or spark-testing-base libraries for DataFrame comparison.

Q9

Your Spark job writes 10,000 small files to S3. What is the problem, and how do you fix it?

MediumAmazon, Databricks, Netflix

What the interviewer is really testing

The small file problem. Many small files hurt read performance (too many S3 LIST calls, too many tasks to open files). The interviewer wants practical solutions.

Approach hint

Coalesce before writing to reduce output partitions. Use repartition(N) where N targets files of 128MB to 256MB each. For Delta Lake, run OPTIMIZE to compact small files after writing. Discuss why this matters: downstream jobs reading 10,000 files spend more time on I/O overhead than actual computation.

Q10

Explain the Catalyst optimizer and how it affects the execution of a Spark SQL query.

MediumDatabricks, Google, Microsoft

What the interviewer is really testing

Whether you understand what happens between writing a query and running it. The interviewer wants to hear about logical plan, physical plan, predicate pushdown, and column pruning.

Approach hint

Catalyst stages: parsing (SQL to unresolved logical plan), analysis (resolve names), optimization (predicate pushdown, column pruning, constant folding, join reordering), physical planning (choose join strategy, scan method). Use explain(true) to see all stages. Mention that Tungsten handles the code generation for the final execution.

Why These Mock Interview Questions Work

Most question lists on the internet fall into two traps. Either they list vague questions ("Tell me about a time you worked with data") with no technical depth, or they list hyper-specific questions ("Write a query to find the Nth highest salary") that nobody actually asks in 2026. The questions on this page avoid both traps.

They reflect real interview patterns. We analyzed 1,042 verified data engineering interview rounds across 275 companies to identify the question patterns that appear most frequently. Deduplication with ROW_NUMBER, sessionization with LAG, schema design for a two-sided marketplace: these patterns recur across companies because they test the skills that matter in production.

They include the hidden evaluation criteria. When an interviewer asks you to "deduplicate a table," they're not just testing whether you know ROW_NUMBER. They're testing whether you ask about the composite key, whether you handle ties, whether you mention that this is an idempotent pipeline pattern. The "what the interviewer is really testing" field gives you this meta-knowledge.

They cover all 5 domains. Most question lists focus on SQL only. Data engineering interviews in 2026 test five distinct domains: SQL (41%), Python (35%), Data Modeling (18%), Pipeline Architecture (3%), and Spark (for senior and specialized roles). Skipping any domain leaves a gap that interviewers will find.

They link to executable practice. Reading questions and hints is step one. Solving them in a real coding environment with AI feedback is step two. On DataDriven, every question runs against a real database (for SQL), real Python (for Python), and real PySpark (for Spark). Your code actually runs, and the AI grader checks correctness, flags edge cases you missed, and gives line-by-line feedback on your code.

Mock Interview Question Distribution by Domain

Your practice time should mirror the frequency of each domain in real interviews. Here is the breakdown from our analysis of 1,042 verified rounds, along with the recommended number of practice questions for an 8-week prep plan.

SQL
41%

Recommended: 40 to 50 questions

Python
35%

Recommended: 20 to 30 questions

Data Modeling
18%

Recommended: 10 to 15 exercises

Pipeline Architecture
3%

Recommended: 8 to 10 designs

Spark
Senior roles

Recommended: 10 to 15 questions

From Question Practice to Full Mock Interviews

Practicing individual questions builds skills. Practicing full mock interviews builds stamina, time management, and the ability to perform under pressure. Here is how to transition from question practice to mock interview practice.

Weeks 1 to 4: question practice. Solve individual questions by domain. Focus on understanding the patterns, not memorizing solutions. When you can solve Medium SQL questions in under 15 minutes and Medium Python questions in under 20 minutes, you are ready for the next phase.

Weeks 5 to 6: timed rounds. Simulate single interview rounds. Set a 45-minute timer. Pick 2 to 3 questions from a single domain and solve them sequentially, explaining your thought process out loud. Grade yourself on correctness, speed, and communication quality.

Weeks 7 to 8: full mock interviews. Run complete mock interview loops. One SQL round (45 min), one Python round (45 min), one system design round (45 min), and one behavioral round (30 min). Take 15-minute breaks between rounds, just like a real onsite. DataDriven's mock interview simulator automates this: it selects questions matching your target company and level, times each round, and gives AI feedback at the end.

Mock Interview Questions FAQ

How many mock interview questions should I practice before a real interview?+
For data engineering roles, aim for 80 to 100 questions total: 40 SQL, 20 Python, 10 data modeling exercises, 8 pipeline architecture designs, and 10 Spark problems (if the role requires Spark). Quality matters more than quantity. Solve each question fully: write the code, run it, check edge cases, and explain your approach out loud. Skimming 200 questions without going deep on any of them won't help.
What difficulty level should I focus on for mock interview questions?+
Start with Easy questions to build speed and confidence (days 1 to 5). Move to Medium questions for the bulk of your practice (days 6 to 25). Tackle Hard questions only after you can solve Medium problems consistently within the time limit. Most real interview questions are Medium difficulty. Hard questions appear in senior and staff-level interviews.
Do mock interview questions match what companies actually ask?+
The best mock questions are based on verified interview rounds, not invented by content writers. DataDriven's question bank is built from 1,042 verified data engineering interview rounds across 275 companies. Each question includes which companies have asked similar questions, what the interviewer is really testing, and approach hints from engineers who passed those interviews.
Should I practice mock interview questions with a timer?+
Yes, always. SQL and Python rounds are typically 30 to 45 minutes. Data modeling rounds are 45 to 60 minutes. If you practice without a timer, you'll develop a false sense of readiness. Start without a timer for the first few days to build understanding, then switch to timed practice. On DataDriven, the mock interview simulator includes a countdown timer that matches real interview conditions.
What is the difference between mock interview questions and practice problems?+
Practice problems teach you individual skills (window functions, joins, recursion). Mock interview questions simulate the full interview experience: ambiguous problem statements, time pressure, and the expectation that you explain your reasoning while coding. DataDriven offers both: 1,000+ practice problems for skill building and a mock interview simulator that combines questions with timed rounds and AI feedback.

50 Questions Here. 1,000+ on DataDriven.

Every question includes a full solution, real code execution, and AI grading with line-by-line feedback. Practice the way the interview works.