Data Engineering Interview Prep

Data Engineer Interview Questions

Of 1,042 verified DE interview rounds we analyzed, SQL showed up in 41% of the questions, Python in 35%, data modeling in 18%, system design in 3%. L5 senior candidates faced the bulk of them: 61% of all rounds. The questions on this page mirror that distribution exactly, so your practice time tracks the real weight of what interviewers actually ask.

Run every SQL and Python example against a live Postgres or Docker sandbox on DataDriven before the real thing.

1,042

Verified DE questions

41%

SQL share of corpus

61%

L5 senior rounds

275

Companies represented

Source: DataDriven analysis of 1,042 verified data engineering interview rounds.

What Data Engineering Interviews Actually Test

Our corpus spans 275 companies across 1,042 rounds. Phone screens account for 33% of rounds, technical screens 21%, onsite SQL 12%. Here's the domain frequency breakdown from that dataset.

SQL
95% of interview loops
7 questions below
Python
65% of interview loops
5 questions below
Data Modeling
40% of interview loops
5 questions below
Pipeline Architecture
30% of interview loops
5 questions below
95% of interview loops

SQL Interview Questions

SQL appears in nearly every data engineering interview loop. Phone screens, take-home assignments, and onsite rounds all include SQL. The format is consistent: you get a schema, a business question, and 15 to 30 minutes to write a query. Interviewers evaluate correctness, edge case handling, readability, and sometimes performance. GROUP BY is the most frequently tested keyword, followed by JOINs and window functions.

Q1

Write a query to deduplicate a table, keeping only the most recent record per user.

How to approach this

Use ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY updated_at DESC) in a CTE, then filter WHERE rn = 1. Mention why ROW_NUMBER is better than DISTINCT or GROUP BY for this problem: you control which duplicate to keep, and you can access all columns without re-joining. Bring up the fact that this is a daily pipeline operation, not just an interview trick.

Q2

Calculate month-over-month revenue growth as a percentage.

How to approach this

Aggregate to monthly grain with DATE_TRUNC and SUM, then use LAG to get previous month revenue in a CTE. Compute (current - previous) / NULLIF(previous, 0) * 100. Explain why NULLIF prevents division by zero for the first month. Interviewers will ask about the first row where LAG returns NULL.

Q3

Find users who were active for at least 3 consecutive days.

How to approach this

This is a gap-and-island problem. Assign ROW_NUMBER() within each user ordered by date, then subtract row_number from the date to create a grouping key. Consecutive dates produce the same grouping key. GROUP BY user and grouping key, then HAVING COUNT(*) >= 3. Walk the interviewer through why the subtraction trick works with a small example.

Q4

Write a self-join to find pairs of employees who share the same manager.

How to approach this

JOIN employees e1 ON employees e2 WHERE e1.manager_id = e2.manager_id AND e1.id < e2.id. The inequality condition (< not !=) prevents duplicate pairs (Alice-Bob and Bob-Alice). Mention that without this condition, you get each pair twice plus self-pairs.

Q5

Use a recursive CTE to find all reports under a given manager, including indirect reports.

How to approach this

Base case: SELECT employees WHERE manager_id = target. Recursive case: JOIN employees to the CTE on manager_id. Include a depth counter. Mention the importance of a termination condition to prevent infinite loops in circular org data (add WHERE depth < 20 or track visited IDs).

Q6

Calculate a 7-day rolling average of daily revenue, excluding partial windows.

How to approach this

Use AVG with ROWS BETWEEN 6 PRECEDING AND CURRENT ROW. Wrap in a CTE and add ROW_NUMBER to filter out the first 6 rows (partial windows). Explain ROWS vs RANGE: RANGE groups ties, ROWS counts physical rows. Most candidates forget about partial windows; bringing it up proactively impresses interviewers.

Q7

Find the top 3 products by revenue in each category, handling ties.

How to approach this

DENSE_RANK() OVER (PARTITION BY category ORDER BY revenue DESC). Filter WHERE rank <= 3. Explain why DENSE_RANK, not RANK or ROW_NUMBER: DENSE_RANK keeps all tied products and does not skip ranks, so you get a true top-3 even with ties. RANK would skip numbers, ROW_NUMBER would arbitrarily exclude tied products.

65% of interview loops

Python Interview Questions

Python appears in about two-thirds of DE interview loops, usually as a coding round or a take-home assignment. The questions focus on data manipulation, not algorithms. You will parse files, transform nested structures, implement pipeline logic, and work with pandas DataFrames. Interviewers care about clean code, edge case handling, and whether you can explain your approach while writing it.

Q1

Parse a nested JSON file and flatten it into a tabular structure.

How to approach this

Use recursion to walk the JSON tree, building column names by concatenating keys with a separator (e.g., 'address.city'). Handle arrays by either exploding them into rows or serializing them as strings. Show that you think about edge cases: empty arrays, null values, inconsistent nesting depths. Mention that in production you would use a library like pandas json_normalize, but writing the recursive version shows you understand the underlying logic.

Q2

Implement a function that deduplicates records by a composite key, keeping the most recent.

How to approach this

Sort by the composite key plus timestamp descending, then iterate and keep the first occurrence of each key. Or use a dictionary keyed on the composite key, updating only when the timestamp is newer. Discuss time complexity: sorting is O(n log n), the dictionary approach is O(n). Mention that this is the Python equivalent of the SQL ROW_NUMBER deduplication pattern.

Q3

Write a generator that reads a large CSV file in chunks and yields transformed rows.

How to approach this

Use a generator function with yield to process rows lazily without loading the entire file into memory. Open the file, read a chunk of N lines, transform each row, and yield it. Explain why generators matter for data pipelines: a 50GB file does not fit in memory, but you can process it row by row or chunk by chunk. Mention pandas read_csv with chunksize as the production equivalent.

Q4

Given a list of events with timestamps, sessionize them using a 30-minute inactivity gap.

How to approach this

Sort events by user and timestamp. Iterate through events, comparing each timestamp to the previous one for the same user. If the gap exceeds 30 minutes, increment the session counter. Assign the current session ID to the event. This maps directly to a real pipeline use case: web analytics sessionization. Walk through a concrete example with 5 to 6 events.

Q5

Implement retry logic with exponential backoff for an API call.

How to approach this

Write a decorator or wrapper function that catches exceptions, waits 2^attempt seconds (with jitter), and retries up to a max number of attempts. Jitter prevents thundering herd when multiple workers retry simultaneously. Mention that production systems use libraries like tenacity, but the interviewer wants to see you implement the core logic from scratch.

40% of interview loops

Data Modeling Interview Questions

Data modeling rounds test your ability to design schemas that serve business requirements. You will be asked to model an e-commerce platform, a social network, a ride-sharing service, or similar domain. Interviewers evaluate whether you choose the right modeling technique (star schema, snowflake, data vault, OBT), handle slowly changing dimensions, and think about query patterns that downstream analysts will run.

Q1

Design a star schema for an e-commerce analytics warehouse.

How to approach this

Identify the grain of the fact table first: one row per order line item. Fact table: order_id, product_id, customer_id, store_id, date_id, quantity, unit_price, discount, total_amount. Dimension tables: dim_product (name, category, brand), dim_customer (name, segment, region), dim_date (date, month, quarter, year, is_weekend), dim_store (name, city, state). Explain why star schemas outperform normalized schemas for analytical queries: fewer joins, simpler SQL, and columnar storage compression benefits.

Q2

How would you handle a slowly changing dimension for customer addresses?

How to approach this

Describe SCD Types 1, 2, and 3. Type 1 overwrites the old value (simple, loses history). Type 2 adds a new row with effective_date, end_date, and is_current flag (preserves full history, most common in practice). Type 3 adds columns for previous values (limited history, rarely used). Recommend Type 2 for the customer address use case because analysts need to know which address was active when an order was placed. Walk through the pipeline logic: compare incoming records to current dimension, insert new rows for changes, update end_date on old rows.

Q3

When would you use a data vault model instead of a star schema?

How to approach this

Data vault separates structural data (hubs), relationships (links), and descriptive data (satellites). Use it when sources change frequently, you need full auditability, or multiple teams load data independently. The tradeoff: data vault is harder to query directly (more joins), so you typically build star schema business vaults on top. Interviewers want to see that you know both approaches and can articulate when each is appropriate.

Q4

Model a ride-sharing platform with riders, drivers, trips, and surge pricing.

How to approach this

Fact table: trips at grain of one row per trip (trip_id, rider_id, driver_id, pickup_location_id, dropoff_location_id, start_time, end_time, distance_miles, base_fare, surge_multiplier, total_fare, tip). Dimensions: dim_rider, dim_driver (with rating history as SCD Type 2), dim_location (lat, long, zone, city), dim_date. Discuss why surge_multiplier belongs in the fact table (it varies per trip) while driver rating belongs in the dimension (it changes slowly). Address the location dimension challenge: do you model pickup and dropoff as two foreign keys to the same dimension, or as separate dimensions?

Q5

What is the medallion architecture and when would you use it?

How to approach this

Bronze (raw ingestion, append-only), Silver (cleaned, deduplicated, typed), Gold (aggregated, business-ready). Used in lakehouse architectures (Databricks, Delta Lake). Each layer has clear ownership and quality contracts. Bronze is cheap and fast to load. Silver handles deduplication, schema enforcement, and NULL handling. Gold serves specific business domains. Mention that this is not a replacement for star schemas; Gold tables often are star schemas.

30% of interview loops

Pipeline Architecture Interview Questions

Pipeline architecture questions appear in system design rounds, usually for mid-to-senior level positions. You are given a scenario (build a pipeline that ingests 10M events per day, transform and load into a warehouse, with SLA of 15 minutes) and asked to design the end-to-end system. Interviewers evaluate your understanding of ingestion patterns, transformation strategies, orchestration, failure handling, and monitoring.

Q1

Design a pipeline that ingests clickstream data from a web application into a data warehouse.

How to approach this

Start with the source: a web app emitting events to a message queue (Kafka) for durability and decoupling. A stream processor (Flink or Spark Streaming) consumes events, applies basic validation (schema conformity, timestamp parsing), and writes to a staging area in the warehouse or a data lake. A batch job runs every 15 minutes to transform staged data into analytics-ready tables. Address exactly-once semantics: use idempotent writes with a dedup key (event_id + timestamp). Discuss monitoring: track event lag, throughput, and schema drift.

Q2

How would you make a pipeline idempotent?

How to approach this

Idempotency means running the pipeline twice with the same input produces the same result. Three strategies: (1) DELETE-then-INSERT for the partition being processed, (2) MERGE/UPSERT using a natural key, (3) write to a temp table then atomically swap. Explain why idempotency matters: retries after failures, backfills, and re-processing should not create duplicates or corrupt data. Give a concrete example: a daily pipeline that processes yesterday's data should be safe to re-run without doubling the row count.

Q3

You notice a pipeline SLA breach at 3 AM. Walk me through your debugging process.

How to approach this

Check the orchestrator (Airflow, Dagster) for which task failed and the error message. If a task timed out, check the data volume for that partition (was there a data spike?). If it errored, check the logs for the specific exception. Common causes: source schema changed (new column, changed type), data volume spike, infrastructure issue (out of memory, network timeout), upstream dependency delivered late. After fixing, validate the output data, re-run the failed task, and add monitoring for the root cause to prevent recurrence.

Q4

Compare batch vs streaming for a real-time analytics dashboard.

How to approach this

Batch: simpler, cheaper, easier to debug, but latency is bounded by the batch interval (minimum 1 to 5 minutes). Streaming: sub-second latency, but more complex to build, test, and operate. Exactly-once semantics are harder in streaming. For a real-time dashboard, ask what 'real-time' means: if 5-minute latency is acceptable, micro-batch (Spark Structured Streaming with 1-minute triggers) is simpler than full streaming. If sub-second latency is required, use Kafka + Flink with event-time processing and watermarks.

Q5

How do you handle schema evolution in a data pipeline?

How to approach this

Three levels: (1) source schema changes (new columns, renamed fields), (2) pipeline transformation logic, (3) warehouse schema. Use a schema registry (Confluent Schema Registry for Avro/Protobuf) to detect breaking changes before they hit the pipeline. For additive changes (new nullable column), handle automatically by adding the column downstream. For breaking changes (removed column, type change), fail loudly and alert the on-call engineer. Store raw data in a bronze layer so you can always reprocess after fixing the schema mapping.

10-Week Data Engineering Interview Study Plan

This plan assumes you have basic programming experience and some SQL familiarity. Adjust the timeline based on your starting point. The key principle: spend time proportional to interview frequency. SQL gets the most weeks because it appears in the most rounds.

Weeks 1 to 2

SQL Foundations

Spend the first two weeks exclusively on SQL. Cover GROUP BY, JOINs, subqueries, and window functions. Solve 40 to 50 practice problems. By the end of week 2, you should solve medium-difficulty SQL problems in under 15 minutes. SQL appears in 95% of loops, so this is your highest-impact investment.

Weeks 3 to 4

Python Data Manipulation

Shift to Python. Practice parsing JSON, working with dictionaries and lists, writing generators, and using pandas for data transformation. Solve 20 to 30 problems. Focus on clean code and edge case handling. If your target company does not include a Python round (some do not), reduce this to one week and add the extra week to modeling.

Weeks 5 to 6

Data Modeling

Study star schemas, snowflake schemas, data vault, and the medallion architecture. Practice designing schemas for common domains: e-commerce, social networks, ride-sharing, SaaS products. For each design, write the DDL and explain your tradeoffs out loud. Modeling questions reward clear communication as much as technical correctness.

Weeks 7 to 8

Pipeline Architecture and System Design

Study ingestion patterns (batch vs streaming), orchestration (Airflow, Dagster), idempotency, failure handling, and monitoring. Practice drawing architecture diagrams on a whiteboard or digital canvas. For each design, articulate the tradeoffs: why Kafka instead of a simple API? Why Spark instead of pandas? System design rounds reward structured thinking and clear tradeoff analysis.

Weeks 9 to 10

Mock Interviews and Weak Spots

Run full mock interviews: one SQL round (45 minutes), one Python round (45 minutes), one system design round (45 minutes). Identify weak areas and revisit them. Switch to timed drills for SQL and Python. Practice explaining your thought process out loud while writing code. Many strong engineers fail interviews not because they cannot solve the problem, but because they cannot communicate their approach clearly.

How to Succeed in Data Engineering Interviews

Think out loud. Interviewers cannot evaluate your reasoning if you sit silently for 5 minutes and then produce an answer. Narrate your thought process: state the problem in your own words, identify the key tables and relationships, outline your approach, then write the code. If you get stuck, saying what you are stuck on is better than staring at the screen.

Ask clarifying questions. Good interviewers expect it. Bad candidates assume. When you get a schema design question, ask about the expected query patterns. When you get a SQL problem, ask about edge cases: are there NULLs? Can a user have zero orders? Is the data deduplicated? These questions show production thinking.

Prioritize correctness over cleverness. A simple, correct solution that uses a CTE and two JOINs beats a clever one-liner that the interviewer cannot follow. Readability matters in production code and it matters in interviews. Use meaningful aliases, break complex logic into steps, and format your code neatly.

Know your tradeoffs. Every design decision has a tradeoff. Batch vs streaming. Star schema vs data vault. Idempotent upserts vs append-only tables. Interviewers do not want a single right answer; they want to see that you understand the tradeoffs and can justify your choice for the given requirements.

Practice under realistic conditions. Solving problems in your IDE with autocomplete and documentation is not the same as solving them in a shared editor during a Zoom call. Practice in a plain text editor with a timer. Practice explaining your approach to someone else (or to yourself out loud). The interview is a performance, and performances improve with rehearsal.

Data Engineering Interview FAQ

What is the most important topic for data engineering interviews?+
SQL, by a wide margin. It appears in 95% of interview loops, often in multiple rounds. If you only have time to prepare one topic, make it SQL. Specifically: window functions, CTEs, JOINs, and aggregation. These four sub-topics cover the vast majority of SQL interview questions.
How many data engineering interview questions should I practice?+
Quality over quantity. Aim for 80 to 100 total across all domains: 50 SQL, 20 Python, 10 modeling exercises, and 5 system design walkthroughs. Solve each problem completely (write the code, run it, check edge cases) rather than skimming 200 questions without going deep on any of them.
Do I need to know Spark or Airflow for data engineering interviews?+
It depends on the role. Most interviews test fundamentals (SQL, Python, modeling) rather than specific tools. But if the job description mentions Spark, Airflow, Kafka, or dbt, expect at least one question about that tool. Knowing the concepts behind orchestration and distributed processing is more valuable than memorizing API syntax.
What is the typical structure of a data engineering interview loop?+
Most loops include 4 to 6 rounds: a recruiter screen, a technical phone screen (usually SQL), 2 to 3 onsite rounds (SQL, Python or coding, system design or data modeling), and a behavioral round. Some companies add a take-home assignment before the onsite. The exact structure varies, but SQL and at least one design-oriented round are almost universal.
How long should I study before applying to data engineering roles?+
If you already have programming experience and some SQL knowledge, 8 to 10 weeks of focused preparation is enough for most candidates. If you are starting from zero, plan for 4 to 6 months. The study plan on this page covers 10 weeks and assumes you already know basic programming. Adjust based on your starting point.

1,042 Questions. Four Domains. One Plan.

Distribute your prep time the same way interviewers distribute their questions: 41/35/18/3.