Of 1,042 verified DE interview rounds we analyzed, SQL showed up in 41% of the questions, Python in 35%, data modeling in 18%, system design in 3%. L5 senior candidates faced the bulk of them: 61% of all rounds. The questions on this page mirror that distribution exactly, so your practice time tracks the real weight of what interviewers actually ask.
Run every SQL and Python example against a live Postgres or Docker sandbox on DataDriven before the real thing.
Verified DE questions
SQL share of corpus
L5 senior rounds
Companies represented
Source: DataDriven analysis of 1,042 verified data engineering interview rounds.
Our corpus spans 275 companies across 1,042 rounds. Phone screens account for 33% of rounds, technical screens 21%, onsite SQL 12%. Here's the domain frequency breakdown from that dataset.
SQL appears in nearly every data engineering interview loop. Phone screens, take-home assignments, and onsite rounds all include SQL. The format is consistent: you get a schema, a business question, and 15 to 30 minutes to write a query. Interviewers evaluate correctness, edge case handling, readability, and sometimes performance. GROUP BY is the most frequently tested keyword, followed by JOINs and window functions.
How to approach this
Use ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY updated_at DESC) in a CTE, then filter WHERE rn = 1. Mention why ROW_NUMBER is better than DISTINCT or GROUP BY for this problem: you control which duplicate to keep, and you can access all columns without re-joining. Bring up the fact that this is a daily pipeline operation, not just an interview trick.
How to approach this
Aggregate to monthly grain with DATE_TRUNC and SUM, then use LAG to get previous month revenue in a CTE. Compute (current - previous) / NULLIF(previous, 0) * 100. Explain why NULLIF prevents division by zero for the first month. Interviewers will ask about the first row where LAG returns NULL.
How to approach this
This is a gap-and-island problem. Assign ROW_NUMBER() within each user ordered by date, then subtract row_number from the date to create a grouping key. Consecutive dates produce the same grouping key. GROUP BY user and grouping key, then HAVING COUNT(*) >= 3. Walk the interviewer through why the subtraction trick works with a small example.
How to approach this
JOIN employees e1 ON employees e2 WHERE e1.manager_id = e2.manager_id AND e1.id < e2.id. The inequality condition (< not !=) prevents duplicate pairs (Alice-Bob and Bob-Alice). Mention that without this condition, you get each pair twice plus self-pairs.
How to approach this
Base case: SELECT employees WHERE manager_id = target. Recursive case: JOIN employees to the CTE on manager_id. Include a depth counter. Mention the importance of a termination condition to prevent infinite loops in circular org data (add WHERE depth < 20 or track visited IDs).
How to approach this
Use AVG with ROWS BETWEEN 6 PRECEDING AND CURRENT ROW. Wrap in a CTE and add ROW_NUMBER to filter out the first 6 rows (partial windows). Explain ROWS vs RANGE: RANGE groups ties, ROWS counts physical rows. Most candidates forget about partial windows; bringing it up proactively impresses interviewers.
How to approach this
DENSE_RANK() OVER (PARTITION BY category ORDER BY revenue DESC). Filter WHERE rank <= 3. Explain why DENSE_RANK, not RANK or ROW_NUMBER: DENSE_RANK keeps all tied products and does not skip ranks, so you get a true top-3 even with ties. RANK would skip numbers, ROW_NUMBER would arbitrarily exclude tied products.
Python appears in about two-thirds of DE interview loops, usually as a coding round or a take-home assignment. The questions focus on data manipulation, not algorithms. You will parse files, transform nested structures, implement pipeline logic, and work with pandas DataFrames. Interviewers care about clean code, edge case handling, and whether you can explain your approach while writing it.
How to approach this
Use recursion to walk the JSON tree, building column names by concatenating keys with a separator (e.g., 'address.city'). Handle arrays by either exploding them into rows or serializing them as strings. Show that you think about edge cases: empty arrays, null values, inconsistent nesting depths. Mention that in production you would use a library like pandas json_normalize, but writing the recursive version shows you understand the underlying logic.
How to approach this
Sort by the composite key plus timestamp descending, then iterate and keep the first occurrence of each key. Or use a dictionary keyed on the composite key, updating only when the timestamp is newer. Discuss time complexity: sorting is O(n log n), the dictionary approach is O(n). Mention that this is the Python equivalent of the SQL ROW_NUMBER deduplication pattern.
How to approach this
Use a generator function with yield to process rows lazily without loading the entire file into memory. Open the file, read a chunk of N lines, transform each row, and yield it. Explain why generators matter for data pipelines: a 50GB file does not fit in memory, but you can process it row by row or chunk by chunk. Mention pandas read_csv with chunksize as the production equivalent.
How to approach this
Sort events by user and timestamp. Iterate through events, comparing each timestamp to the previous one for the same user. If the gap exceeds 30 minutes, increment the session counter. Assign the current session ID to the event. This maps directly to a real pipeline use case: web analytics sessionization. Walk through a concrete example with 5 to 6 events.
How to approach this
Write a decorator or wrapper function that catches exceptions, waits 2^attempt seconds (with jitter), and retries up to a max number of attempts. Jitter prevents thundering herd when multiple workers retry simultaneously. Mention that production systems use libraries like tenacity, but the interviewer wants to see you implement the core logic from scratch.
Data modeling rounds test your ability to design schemas that serve business requirements. You will be asked to model an e-commerce platform, a social network, a ride-sharing service, or similar domain. Interviewers evaluate whether you choose the right modeling technique (star schema, snowflake, data vault, OBT), handle slowly changing dimensions, and think about query patterns that downstream analysts will run.
How to approach this
Identify the grain of the fact table first: one row per order line item. Fact table: order_id, product_id, customer_id, store_id, date_id, quantity, unit_price, discount, total_amount. Dimension tables: dim_product (name, category, brand), dim_customer (name, segment, region), dim_date (date, month, quarter, year, is_weekend), dim_store (name, city, state). Explain why star schemas outperform normalized schemas for analytical queries: fewer joins, simpler SQL, and columnar storage compression benefits.
How to approach this
Describe SCD Types 1, 2, and 3. Type 1 overwrites the old value (simple, loses history). Type 2 adds a new row with effective_date, end_date, and is_current flag (preserves full history, most common in practice). Type 3 adds columns for previous values (limited history, rarely used). Recommend Type 2 for the customer address use case because analysts need to know which address was active when an order was placed. Walk through the pipeline logic: compare incoming records to current dimension, insert new rows for changes, update end_date on old rows.
How to approach this
Data vault separates structural data (hubs), relationships (links), and descriptive data (satellites). Use it when sources change frequently, you need full auditability, or multiple teams load data independently. The tradeoff: data vault is harder to query directly (more joins), so you typically build star schema business vaults on top. Interviewers want to see that you know both approaches and can articulate when each is appropriate.
How to approach this
Fact table: trips at grain of one row per trip (trip_id, rider_id, driver_id, pickup_location_id, dropoff_location_id, start_time, end_time, distance_miles, base_fare, surge_multiplier, total_fare, tip). Dimensions: dim_rider, dim_driver (with rating history as SCD Type 2), dim_location (lat, long, zone, city), dim_date. Discuss why surge_multiplier belongs in the fact table (it varies per trip) while driver rating belongs in the dimension (it changes slowly). Address the location dimension challenge: do you model pickup and dropoff as two foreign keys to the same dimension, or as separate dimensions?
How to approach this
Bronze (raw ingestion, append-only), Silver (cleaned, deduplicated, typed), Gold (aggregated, business-ready). Used in lakehouse architectures (Databricks, Delta Lake). Each layer has clear ownership and quality contracts. Bronze is cheap and fast to load. Silver handles deduplication, schema enforcement, and NULL handling. Gold serves specific business domains. Mention that this is not a replacement for star schemas; Gold tables often are star schemas.
Pipeline architecture questions appear in system design rounds, usually for mid-to-senior level positions. You are given a scenario (build a pipeline that ingests 10M events per day, transform and load into a warehouse, with SLA of 15 minutes) and asked to design the end-to-end system. Interviewers evaluate your understanding of ingestion patterns, transformation strategies, orchestration, failure handling, and monitoring.
How to approach this
Start with the source: a web app emitting events to a message queue (Kafka) for durability and decoupling. A stream processor (Flink or Spark Streaming) consumes events, applies basic validation (schema conformity, timestamp parsing), and writes to a staging area in the warehouse or a data lake. A batch job runs every 15 minutes to transform staged data into analytics-ready tables. Address exactly-once semantics: use idempotent writes with a dedup key (event_id + timestamp). Discuss monitoring: track event lag, throughput, and schema drift.
How to approach this
Idempotency means running the pipeline twice with the same input produces the same result. Three strategies: (1) DELETE-then-INSERT for the partition being processed, (2) MERGE/UPSERT using a natural key, (3) write to a temp table then atomically swap. Explain why idempotency matters: retries after failures, backfills, and re-processing should not create duplicates or corrupt data. Give a concrete example: a daily pipeline that processes yesterday's data should be safe to re-run without doubling the row count.
How to approach this
Check the orchestrator (Airflow, Dagster) for which task failed and the error message. If a task timed out, check the data volume for that partition (was there a data spike?). If it errored, check the logs for the specific exception. Common causes: source schema changed (new column, changed type), data volume spike, infrastructure issue (out of memory, network timeout), upstream dependency delivered late. After fixing, validate the output data, re-run the failed task, and add monitoring for the root cause to prevent recurrence.
How to approach this
Batch: simpler, cheaper, easier to debug, but latency is bounded by the batch interval (minimum 1 to 5 minutes). Streaming: sub-second latency, but more complex to build, test, and operate. Exactly-once semantics are harder in streaming. For a real-time dashboard, ask what 'real-time' means: if 5-minute latency is acceptable, micro-batch (Spark Structured Streaming with 1-minute triggers) is simpler than full streaming. If sub-second latency is required, use Kafka + Flink with event-time processing and watermarks.
How to approach this
Three levels: (1) source schema changes (new columns, renamed fields), (2) pipeline transformation logic, (3) warehouse schema. Use a schema registry (Confluent Schema Registry for Avro/Protobuf) to detect breaking changes before they hit the pipeline. For additive changes (new nullable column), handle automatically by adding the column downstream. For breaking changes (removed column, type change), fail loudly and alert the on-call engineer. Store raw data in a bronze layer so you can always reprocess after fixing the schema mapping.
This plan assumes you have basic programming experience and some SQL familiarity. Adjust the timeline based on your starting point. The key principle: spend time proportional to interview frequency. SQL gets the most weeks because it appears in the most rounds.
Spend the first two weeks exclusively on SQL. Cover GROUP BY, JOINs, subqueries, and window functions. Solve 40 to 50 practice problems. By the end of week 2, you should solve medium-difficulty SQL problems in under 15 minutes. SQL appears in 95% of loops, so this is your highest-impact investment.
Shift to Python. Practice parsing JSON, working with dictionaries and lists, writing generators, and using pandas for data transformation. Solve 20 to 30 problems. Focus on clean code and edge case handling. If your target company does not include a Python round (some do not), reduce this to one week and add the extra week to modeling.
Study star schemas, snowflake schemas, data vault, and the medallion architecture. Practice designing schemas for common domains: e-commerce, social networks, ride-sharing, SaaS products. For each design, write the DDL and explain your tradeoffs out loud. Modeling questions reward clear communication as much as technical correctness.
Study ingestion patterns (batch vs streaming), orchestration (Airflow, Dagster), idempotency, failure handling, and monitoring. Practice drawing architecture diagrams on a whiteboard or digital canvas. For each design, articulate the tradeoffs: why Kafka instead of a simple API? Why Spark instead of pandas? System design rounds reward structured thinking and clear tradeoff analysis.
Run full mock interviews: one SQL round (45 minutes), one Python round (45 minutes), one system design round (45 minutes). Identify weak areas and revisit them. Switch to timed drills for SQL and Python. Practice explaining your thought process out loud while writing code. Many strong engineers fail interviews not because they cannot solve the problem, but because they cannot communicate their approach clearly.
Think out loud. Interviewers cannot evaluate your reasoning if you sit silently for 5 minutes and then produce an answer. Narrate your thought process: state the problem in your own words, identify the key tables and relationships, outline your approach, then write the code. If you get stuck, saying what you are stuck on is better than staring at the screen.
Ask clarifying questions. Good interviewers expect it. Bad candidates assume. When you get a schema design question, ask about the expected query patterns. When you get a SQL problem, ask about edge cases: are there NULLs? Can a user have zero orders? Is the data deduplicated? These questions show production thinking.
Prioritize correctness over cleverness. A simple, correct solution that uses a CTE and two JOINs beats a clever one-liner that the interviewer cannot follow. Readability matters in production code and it matters in interviews. Use meaningful aliases, break complex logic into steps, and format your code neatly.
Know your tradeoffs. Every design decision has a tradeoff. Batch vs streaming. Star schema vs data vault. Idempotent upserts vs append-only tables. Interviewers do not want a single right answer; they want to see that you understand the tradeoffs and can justify your choice for the given requirements.
Practice under realistic conditions. Solving problems in your IDE with autocomplete and documentation is not the same as solving them in a shared editor during a Zoom call. Practice in a plain text editor with a timer. Practice explaining your approach to someone else (or to yourself out loud). The interview is a performance, and performances improve with rehearsal.
Distribute your prep time the same way interviewers distribute their questions: 41/35/18/3.