Data Engineer Interview Questions
Twenty-five worked questions across the four technical rounds. The version that earns the senior signal and the version that loses it. Pulled from verified interview reports and weighted by what actually shows up.
The data engineering loop is four technical rounds. SQL on almost every loop. Python on roughly two-thirds. Modeling on the ones at senior level. Pipeline architecture on the design round. Below: five to seven worked answers per round, with the version that earns the senior signal and the version that loses it.
SQL interview questions
The seven patterns that come up most: window functions, CTEs, gaps-and-islands, and the edge cases interviewers deliberately leave ambiguous.
Write a query to deduplicate a table, keeping only the most recent record per user.
The cleanest answer is a CTE with ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY updated_at DESC), then SELECT WHERE rn = 1. The reason this beats DISTINCT or a GROUP BY with MAX(updated_at) is that you keep the other columns from the chosen row without a self-join back to the table. If the interviewer asks why you didn't use QUALIFY, the answer is dialect: Snowflake and BigQuery have it, Postgres doesn't.
Calculate month-over-month revenue growth as a percentage.
Bucket to month with DATE_TRUNC('month', order_date) and SUM the revenue. In a second CTE, LAG that monthly figure by one row ordered by month. The growth column is (curr - prev) / NULLIF(prev, 0) * 100. The NULLIF is the part interviewers want you to volunteer, because the first month will have a NULL previous and naive division will throw or silently produce inf.
Find users who were active for at least 3 consecutive days.
Classic gaps-and-islands. The trick: for each user, take the activity date and subtract ROW_NUMBER() ordered by that date. Runs of consecutive dates produce a constant difference, so you can GROUP BY (user_id, date - rn) and HAVING COUNT(*) >= 3. If you have never seen the pattern, the best move is to talk through what makes the run detectable rather than guess, because this is one of about five tricks interviewers treat as standard.
Self-join to find pairs of employees who share the same manager.
SELECT e1.name, e2.name FROM employees e1 JOIN employees e2 ON e1.manager_id = e2.manager_id AND e1.id < e2.id. The strict inequality is the whole question. Using != gives you both (A, B) and (B, A) plus self-pairs, which is the failure mode interviewers fish for.
Recursive CTE: find every report under a given manager, direct and indirect.
Anchor: WHERE manager_id = :target_manager. Recursive part: JOIN employees e ON e.manager_id = cte.employee_id. Carry a depth column so you can both bound the recursion (most engines cap around 100 by default) and answer the natural follow-up about org depth. Real HR data sometimes has cycles from bad data entry, which is why the depth guard matters in production, not just for the interview.
7-day rolling average of daily revenue, excluding partial windows.
AVG(revenue) OVER (ORDER BY day ROWS BETWEEN 6 PRECEDING AND CURRENT ROW). The catch is the first six days, which compute averages over fewer than seven rows. Wrap in a CTE, add ROW_NUMBER ordered by day, filter rn >= 7. If pushed, explain the ROWS vs RANGE distinction. RANGE collapses ties, ROWS counts physical preceding rows. For daily revenue where each day is one row, they behave the same. On intraday data they don't.
Top 3 products by revenue in each category, handling ties.
DENSE_RANK() OVER (PARTITION BY category ORDER BY revenue DESC) and filter <= 3. The point of choosing DENSE_RANK over RANK or ROW_NUMBER is what happens when two products tie at 3rd. RANK skips the next number so you lose 4th. ROW_NUMBER picks one arbitrarily so you might drop a real top-3 finisher. DENSE_RANK keeps the ties and continues counting, which is almost always what the business actually wants.
The four rounds, in plain terms
SQL. (Roughly 95% of loops.) Of all the question types in our corpus, SQL is the one that almost every candidate sees. It shows up on the phone screen, often again on the onsite, and sometimes inside a take-home. The format is usually a small schema and a business question with 15 to 30 minutes to write a working query. What gets you graded down is rarely syntax, it's a missed NULL case, a wrong grain, or a window definition that ties when the interviewer wanted a strict order.
Python. (About two-thirds of loops.) Python shows up as a 45-minute coding round or, increasingly often, a four-hour take-home with sample data. The bar is lower than a software engineering coding interview, but the surface area is different. You should be fluent in dict and list comprehensions, json, csv, itertools, and at least one of pandas or polars. Algorithmic puzzles like dynamic programming almost never come up. Messy real-world parsing always does.
Data Modeling. (About 4 in 10 loops, more for senior.) A modeling round is usually a 45-minute whiteboard exercise where you're given a product (a marketplace, a rideshare app, a payments ledger) and a couple of analytical questions to support. The interviewer is checking whether you can pick the right grain, decide which attributes change over time, and resist the urge to over-normalize. Senior candidates get pushed harder on tradeoffs. Junior candidates mostly need to demonstrate they know the difference between a fact and a dimension.
Pipeline Architecture. (Around 30% of loops, most at senior.) The system design round on a DE loop usually gives you a concrete scenario: ten million events a day, a fifteen-minute freshness SLA, a downstream BI tool that can't deal with table swaps. You're expected to whiteboard the end-to-end system. The grading rubric isn't getting the 'right' architecture, it's whether you reason out loud about durability, replay, idempotency, and what happens when your upstream dies on a holiday.
Know the patterns before the interviewer asks them.
What Everyone Is Watching
Someone is watching. Capture everything.
Pulled from debriefs where system design separated levels.
Python interview questions
The parsing and transformation patterns that dominate DE Python rounds. Standard library first; pandas only when the prompt is genuinely tabular.
Parse a nested JSON file and flatten it into a tabular structure.
Walk the tree recursively, joining keys with a dot to form column names ('user.address.zip'). The judgement call is how to handle arrays. If the array represents multiple records, explode into separate rows. If it's a fixed-size attribute like a list of three coordinates, you might serialize or split into positional columns. The interviewer is watching for whether you ask. In production you'd reach for pandas.json_normalize or pl.from_records, but the recursive version shows you understand what the library is doing.
Deduplicate records by a composite key, keeping the most recent.
Two options worth comparing. A dict keyed by the composite key, updating on a newer timestamp, runs in O(n) with O(k) memory. A sort by (key, timestamp DESC) followed by a pass keeping the first occurrence of each key runs in O(n log n) but is friendlier when the dataset doesn't fit in memory and you can stream-sort externally. The dict approach is the SQL ROW_NUMBER pattern in Python.
Generator that reads a large CSV in chunks and yields transformed rows.
Open the file with csv.DictReader, iterate, yield transformed rows one at a time, and let the caller decide what to do with them. The point of generators here is memory: a 50 GB file won't fit anywhere useful, but the line-at-a-time iterator never holds more than one row at a time. The pandas equivalent is read_csv(chunksize=10_000), which gives back a generator of DataFrames if you prefer to vectorize within each chunk.
Sessionize an event stream with a 30-minute inactivity gap.
Sort by (user_id, event_ts). For each user, walk the events and start a new session whenever the gap from the prior event exceeds 30 minutes. Assign a per-user incrementing session counter. The version with itertools.groupby is concise. The version with a plain for-loop and a previous_ts variable is what most candidates write and is fine. The mistake to avoid: comparing against the first event of the session instead of the immediately previous event, which is a different and rarer definition of session.
Retry logic with exponential backoff for an API call.
A decorator that catches a narrow set of exceptions (requests.RequestException, not bare Exception), sleeps 2**attempt seconds plus a random jitter, and gives up after N tries. Jitter matters because without it, every worker that hit the API at the same time retries at exactly the same later time, which is how you turn one outage into a self-inflicted DDoS. Name-drop tenacity as the library you'd actually use in production.
What separates a pass from a no-hire
Narrate while you type. The interviewer is filling out a rubric with categories like 'communication' and 'problem decomposition' in real time. If you go quiet for three minutes and then drop a correct one-liner, those rows on the rubric stay empty. Talk while you think, even if what you're saying is just 'okay, the grain is one row per shipment, so the dedup key should include shipment_id and event_ts.'
Ask the boring clarifying questions first. Can a user have zero orders. Are NULLs possible in the join key. Is the timestamp UTC or local. Are duplicates expected upstream. These are the four questions that catch out 80% of candidates on a SQL round, because most prompts deliberately leave one ambiguous to see if you notice.
Correct beats clever. A query with three CTEs that an interviewer can read in 30 seconds is a better answer than a single nested subquery that takes them five minutes to parse. The interviewer is also the person who will hypothetically code-review you for the next two years. They're imagining that experience as you type.
Always articulate the tradeoff you didn't pick. When you choose Type 2 over Type 1 for a dimension, name what you give up (extra rows, harder current-state queries). When you pick batch over streaming, name what you give up (sub-second latency). Stating the rejected option is what separates an L4 answer from an L5 answer.
Practice in the real environment. CoderPad and HackerRank-style shared editors have no autocomplete, no schema browser, and a font you don't like. If your only practice has been in your own IDE, the first ten minutes of the real interview will feel hostile in a way that costs you points. Run a few problems in a plain text editor with a timer before the loop.
Data modeling interview questions
Whiteboard schema-design questions. Grain, SCD types, and the architecture tradeoffs that separate senior from mid.
Design a star schema for an e-commerce analytics warehouse.
Pin the grain first and say it out loud: one row per order line item. Fact columns are the numeric measures plus the foreign keys: order_id, product_id, customer_id, store_id, date_id, quantity, unit_price, discount, total_amount. Dimensions cover product (category, brand), customer (segment, region), store (city, state), and a date dimension. The why for star over snowflake: analytical queries on columnar storage win from fewer joins and from dimension tables small enough that the engine broadcasts them. The why for star over OBT: shared dimensions stay consistent across multiple fact tables.
How would you handle a slowly changing dimension for customer addresses?
Walk through Type 1 (overwrite, lose history), Type 2 (new row with effective_from, effective_to, is_current), and Type 3 (current and previous columns side by side). For customer address, recommend Type 2, because the question downstream is 'what address did we ship to on the day of this order' and Type 1 makes that unanswerable after the move. The merge logic is the part interviewers want you to describe: compare incoming to is_current row, close out the old row by setting effective_to and is_current = false, insert a new row.
When would you use a data vault instead of a star schema?
Data vault splits the model into hubs (business keys), links (relationships), and satellites (descriptive attributes versioned over time). The case for it is mostly governance: multiple ingestion teams loading independently, full audit lineage, sources that change schema frequently. The tradeoff is that nobody wants to query it directly because every business question becomes a six-table join, so teams build a star-shaped business vault on top. In an interview, the right answer is usually 'star, unless the company is large enough to have a real EDW org and a compliance reason to track every change.'
Model a ride-sharing platform with riders, drivers, trips, and surge pricing.
Grain is one row per trip. Fact columns: trip_id, rider_id, driver_id, pickup_location_id, dropoff_location_id, start_ts, end_ts, distance_miles, base_fare, surge_multiplier, total_fare, tip. Dimensions: dim_rider, dim_driver as SCD2 to keep rating history at the time of the trip, dim_location (probably an H3 cell or zone, not raw lat/long), dim_date and dim_time_of_day. The interesting design call is location: model pickup_location_id and dropoff_location_id as two FKs to the same dim_location, not two separate dims.
What is the medallion architecture and when would you use it?
Three layers. Bronze is raw, append-only, schema-on-read, ideally still in its source format. Silver is cleaned, typed, deduplicated, with bad rows quarantined. Gold is business-ready and modeled, usually as star schemas. Each layer has its own ownership and quality contract, which is the whole point: a bug in silver doesn't require re-ingesting from the source. Medallion is the default on lakehouses like Databricks and Delta. It's not an alternative to star schemas, it's where star schemas live.
Pipeline architecture interview questions
System design scenarios for mid-level and above. Durability, idempotency, and the failure modes interviewers want you to name before being asked.
Design a pipeline that ingests clickstream data into a warehouse.
Start at the producer: the web app SDK shouldn't block on a network call, so events go to a local buffer that ships to Kafka. Kafka gives you durability, replay, and decoupling, which matters because anything else you design downstream can crash without losing data. A stream consumer (Flink, Spark Structured Streaming, or just a Kafka Connect sink for the unambitious version) validates schema, attaches an event_id and ingest_ts, and writes Parquet to S3 partitioned by date and hour. Every 15 minutes, a dbt run promotes new partitions into modeled tables. Dedup on (event_id, source) in the silver layer so producer retries don't double-count.
How would you make a pipeline idempotent?
Three workable patterns. Delete-and-insert the target partition before writing it (cheap for daily fact tables, terrible if the partition is large and the writer crashes mid-flight). MERGE on a stable natural key, which Snowflake and BigQuery both support natively. Write to a side table and atomically swap with a metadata-only operation (CTAS into a new table, then ALTER TABLE RENAME, or in dbt land, the insert_overwrite or table materialization). The principle to articulate: idempotency means you can re-run yesterday's job today and get the same answer, which is what makes backfills and on-call retries safe.
You get paged at 3am for an SLA breach. Walk me through the debug.
First stop is the orchestrator UI, find the failed task and read the exception. If it's a timeout, look at row counts for that partition versus yesterday's. If it's a NOT NULL constraint or a type cast error, the source changed shape, which is the most common cause in 2026. If it's an out-of-memory on the executor, somebody's join exploded. Cluster of slow tasks at once usually means the warehouse is throttled. The order to describe: triage, fix, validate, then write the post-incident. Don't restart blindly, don't mark resolved until row counts match a known good snapshot, and add the monitor you wish you'd had.
Batch vs streaming for a real-time analytics dashboard.
Push back on 'real-time' first. If the dashboard refreshes every five minutes, the right answer is micro-batch: Spark Structured Streaming with a one-minute trigger, or dbt running every five. You get most of the freshness, none of the watermark headaches, and the on-call story stays sane. If the requirement is sub-second (think fraud, ops dashboards, ad serving), you're in real streaming territory: Kafka, Flink, event-time processing, watermarks, and a serving layer that can handle millisecond reads (Druid, Pinot, or ClickHouse). The interviewer is checking whether you ask, not whether you pick.
How do you handle schema evolution in a data pipeline?
Two truths to lead with. First, additive changes (a new nullable column) should propagate automatically. Second, breaking changes (a renamed column, a type narrowing) should refuse to deploy. A schema registry like Confluent's for Avro or Protobuf enforces both at the producer side. Downstream, your bronze layer should keep the raw payload as a JSON blob so you can replay after writing a new mapping. The thing junior candidates miss: schema evolution is also a social problem, and a contract test in the producer's CI is worth more than any amount of pipeline-side validation.
A ten-week study plan
For someone who can already write a JOIN and a for-loop, targeting an onsite in roughly two and a half months. Adjust the timeline if you're starting from somewhere else.
- 01
SQL until it's automatic
Window functions, CTEs, the gaps-and-islands trick, MERGE/UPSERT syntax for at least one warehouse, and the difference between RANK / DENSE_RANK / ROW_NUMBER. Forty to fifty problems is a fine target if you actually solve them rather than skim. The benchmark to hit: a medium-difficulty problem from cold open to passing tests in under fifteen minutes, without using AI assist.
- 02
Python for messy data
Less algorithmic LeetCode, more parsing real things. JSON with optional nested fields. Logs that occasionally have a malformed line. A CSV where the column order changed in June. Twenty to thirty problems. If the job description doesn't mention Python (rare in 2026, but it happens at warehouse-only teams), trim this to one week and roll the saved time into modeling.
- 03
Data modeling on a whiteboard
Star, snowflake, data vault, OBT, medallion. For each, pick a domain (e-commerce, marketplace, payments, content platform) and model it out loud, defending your grain choices and SCD types as you go. Don't practice in your head. Practice with a real whiteboard or a Miro file, because the muscle memory of drawing a schema while explaining is a separate skill from knowing the answer.
- 04
Pipeline architecture
Batch versus streaming, exactly-once, idempotency, orchestration on Airflow or Dagster, failure modes you've actually seen. The interviewer wants the why behind each box you draw. Why Kafka here and not S3 with a notification. Why Spark and not a single-node pandas job. Why a daily dbt run and not a real-time materialized view. If you can articulate two reasons against your own pick, you'll grade well.
- 05
Mocks, weaknesses, talking out loud
Three timed mocks: one SQL, one Python, one design. Find someone to run them with, even another candidate, because the part most people fail isn't the technical answer, it's narrating it without going silent for two minutes. Whatever you tripped on in the mock is your next two days of work.
Common questions
What is the most important topic for data engineering interviews?+
How many data engineering interview questions should I practice?+
Do I need to know Spark or Airflow for data engineering interviews?+
What is the typical structure of a data engineering interview loop?+
How long should I study before applying to data engineering roles?+
Solve the questions, don't just read them
- 01
Active recall beats re-reading by 50%
Cognitive-science meta-reviews (Dunlosky et al., 2013) rank practice testing as a top-tier study technique, while re-reading and highlighting rank near the bottom
- 02
76% of hiring managers reject on the coding task, not the resume
From HackerRank's 2024 Developer Skills Report. Candidates who look strong on paper still fail the live screen if they haven't done timed, executable practice
- 03
Five problem shapes cover 80% of data engineer loops
Dedup, sessionization, top-N-per-group, slowly-changing dimensions, partition tricks. Writing the shapes by hand turns the unfamiliar into pattern recognition
Related Guides
The fifty most common SQL prompts with worked answers and the failure modes interviewers fish for.
The parsing and dedup patterns that come up in DE Python rounds. Not LeetCode.
Walks the entire loop end to end: recruiter screen through onsite, what each round actually grades.