Data Engineer Interview Questions and Answers

Q: Why is the URL 'pdf' if there isn't one?

Because that's how people search for it. Naming the page after the search query is honest; offering a stale PDF would be the dishonest version. The repo on GitHub is the closest analogue if you want something to print or read offline.

Q: Is everything free?

Yes. The questions, the worked answers, the sandbox, the GitHub repo. There's a sign-in to track progress across devices but no paywall.

Q: How often is the question set updated?

Quarterly. Updates happen when new question patterns emerge, when companies change their loop structure, or when reader feedback identifies a missing topic.

Q: Are these from real interviews?

Yes, paraphrased and de-identified to protect the candidates who shared them. Direct quotes from copyrighted question sets are not included.

Q: How accurate are the answers?

Where multiple correct approaches exist, the worked answer names the tradeoffs rather than picking one. Where the answer depends on company context (dialect, scale, latency SLA), the worked answer says so explicitly.

Q: Can I share the GitHub repo?

Yes, with attribution. The repo is licensed Creative Commons Attribution. Bootcamps, study groups, and individual candidates can share it freely as long as the source link is preserved.

Q: Does this cover analytics engineering interviews?

Partially. About sixty percent overlaps. For analytics-engineer-specific prep, see the analytics engineer guide for dbt, semantic layers, and BI workflow questions.

Q: What level should I be?

L3 and L4 candidates get the most out of the SQL and Python sections. L5 candidates should weight modeling and design. L6 candidates can use the set as a checklist to confirm coverage; at L6 the bar is more about depth than breadth.

One hundred data engineering interview questions with worked answers. The page you searched for asked for a PDF; the honest answer is that something live and runnable is better than a stale download. Browse them here, run them in the sandbox, or read the open-source repo on GitHub.

On the PDF

You searched for a PDF. There isn't one, and the honest reason is that a static PDF of interview questions ages badly. The questions interviewers ask shift, the dialect specifics shift, the right answer to "should I learn dbt first or Airflow" shifts. A page that's updated when those things change is more useful than a snapshot you downloaded in March.

The same content lives in three places that are better than a PDF would be:

This site, where you can read the questions and run the SQL and Python with live scoring in the same tab. See the top 100 page.
An open-source GitHub repo if you want to read offline or print your own version: datadriven-io/data-engineering-interview-questions. The repo is updated quarterly. Clone it, paste into your printer, you have your PDF.
A longer study guide with a weekly schedule, self-assessment rubric, and journal template: data-engineering-interview-handbook.

What's in the question set

One hundred questions, weighted by how often each one shows up in real loops. Each has a worked answer and the typical interviewer follow-up.

Section	Count	Topics
SQL	40	Joins, GROUP BY, window functions, CTEs, gap-and-island, recursive queries
Python	25	Wrangling, JSON parsing, dedup, sessionization, pandas, generators
Data Modeling	20	Star schema, SCD Type 1/2/3, conformed dimensions, medallion architecture
System Design	15	Streaming, batch ETL, CDC, exactly-once, schema evolution, backfills
Behavioral	5	STAR answers for impact, conflict, ambiguity, failure, leadership

Ten sample questions

Ten of the hundred, with the answer tight enough to fit in the few sentences you'd actually use in the interview.

SQL · L4

Find users active for three or more consecutive days

Gap-and-island. ROW_NUMBER per user ordered by date, subtracted from the date itself, gives a per-streak grouping key. GROUP BY user and grouping key, HAVING COUNT >= 3. The first time you see the trick it looks like magic; after that it's reflex.

SQL · L4

Month-over-month revenue growth percentage

DATE_TRUNC to month grain, SUM, LAG for the previous month. (current - previous) / NULLIF(previous, 0) * 100. NULLIF prevents division-by-zero on the first month, and volunteering it is the senior signal.

SQL · L5

Top three products by revenue per category, handling ties

DENSE_RANK, not RANK or ROW_NUMBER. DENSE_RANK keeps tied products and doesn't skip ranks. Filter WHERE rk <= 3. Name the difference between the three rank functions without being asked.

Python · L4

Flatten a nested JSON into one level

Recurse over keys, concatenate parent keys with a separator. Decide upfront whether lists explode into rows or get serialized, and say so out loud.

Python · L5

Sessionize events with a 30-minute inactivity gap

Sort by user_id and timestamp, walk the list. Increment session_id when the gap exceeds the threshold or when the user changes. Edge case worth volunteering: events sharing a timestamp.

Modeling · L4

Star schema for an e-commerce platform

Grain first: one row per order line item. Fact holds order_item_id, order_id, product_id, customer_id, date_id, quantity, unit_price, total. Dimensions for customer (Type 1 plus Type 2 split), product, date. State the grain before you draw anything.

Modeling · L5

SCD Type 2 for a customer dimension

Surrogate key, natural key, valid_from, valid_to, is_current. On update, expire the current row and insert a new row with a new surrogate. Facts join on the surrogate for point-in-time correctness.

Design · L5

Clickstream pipeline at 200K events per second

Kafka (100 partitions, key=user_id) → Flink (stateful, exactly-once, sessionize) → S3 plus Materialize. Hourly Spark batch to Snowflake. Be ready to name three failure modes: TaskManager crash, duplicate events, late data beyond the watermark.

Design · L5

Daily reconciliation pipeline for a payments company

Postgres → Debezium → Kafka → S3 raw → idempotent Spark with run_id → Snowflake MERGE on (txn_id, run_id). Reruns produce identical output. Audit by source_event_id. The boring problem nobody practices.

Behavioral · L5

A project with measurable impact

STAR plus a postmortem. Numbers required. Close with one thing you'd do differently. The postmortem is the senior signal; without it you sound mid-level.

How to use the set

01
Start with your weakest domain
Scan the headers, pick the domain you're least confident in, drill that one first while your energy is high. The domain you avoid is the one that takes you down.
02
Practice the answers out loud
Silent reading produces a sense of familiarity that doesn't carry over to verbal explanation under pressure. Say each answer to a wall or a phone recorder. The gaps that come out as hesitation in live rounds show up here first.
03
Pair questions with the round guide
SQL questions go with SQL interview round walkthrough, design questions go with data pipeline system design interview prep, modeling with schema design interview walkthrough. The round guides explain what the interviewer is actually scoring; the questions are the practice reps.
04
Drill the company variants if you have a target
After the generic set, open the relevant company guide: Stripe Data Engineer interview process and questions, Airbnb Data Engineer interview process and questions, Netflix Data Engineer interview process and questions, Databricks Data Engineer interview process and questions. Company guides cover the questions that show up in that specific loop and where the round structure deviates.
05
Run the patterns, don't just read them
Re-implementing each answer from scratch in the in-browser sandbox builds the typing fluency a live round demands. Reading is recognition; writing is recall, and only recall transfers.

Data engineer interview prep FAQ

Why is the URL 'pdf' if there isn't one?+

Because that's how people search for it. Naming the page after the search query is honest; offering a stale PDF would be the dishonest version. The repo on GitHub is the closest analogue if you want something to print or read offline.

Is everything free?+

Yes. The questions, the worked answers, the sandbox, the GitHub repo. There's a sign-in to track progress across devices but no paywall.

How often is the question set updated?+

Quarterly. Updates happen when new question patterns emerge, when companies change their loop structure, or when reader feedback identifies a missing topic.

Are these from real interviews?+

Yes, paraphrased and de-identified to protect the candidates who shared them. Direct quotes from copyrighted question sets are not included.

How accurate are the answers?+

Where multiple correct approaches exist, the worked answer names the tradeoffs rather than picking one. Where the answer depends on company context (dialect, scale, latency SLA), the worked answer says so explicitly.

Can I share the GitHub repo?+

Yes, with attribution. The repo is licensed Creative Commons Attribution. Bootcamps, study groups, and individual candidates can share it freely as long as the source link is preserved.

Does this cover analytics engineering interviews?+

Partially. About sixty percent overlaps. For analytics-engineer-specific prep, see the analytics engineer guide for dbt, semantic layers, and BI workflow questions.

What level should I be?+

L3 and L4 candidates get the most out of the SQL and Python sections. L5 candidates should weight modeling and design. L6 candidates can use the set as a checklist to confirm coverage; at L6 the bar is more about depth than breadth.

02 / Why practice

Run the questions, don't just print them

01
Active recall beats re-reading by 50%
Cognitive-science meta-reviews (Dunlosky et al., 2013) rank practice testing as a top-tier study technique, while re-reading and highlighting rank near the bottom
02
76% of hiring managers reject on the coding task, not the resume
From HackerRank's 2024 Developer Skills Report. Candidates who look strong on paper still fail the live screen if they haven't done timed, executable practice
03
System design is graded on the calls you defend out loud
Ingestion, batch vs streaming, the bronze/silver/gold layers, idempotency, backfill and replay. Sketching the pipeline and naming the failure modes is the signal, not the boxes

Open the problems

Where to read next

More data engineer interview prep guides

top 50 data engineer interview questions→

The 50 most frequently asked data engineer interview questions, with worked answers.

full top 100 Data Engineer interview questions list→

100 of the most asked data engineer interview questions across all four domains.

Meta, Amazon, Apple, Netflix, Google Data Engineer questions→

Real questions from Meta, Amazon, Apple, Netflix, and Google Data Engineer loops, with answers.

Data Engineer take-home walkthroughs→

Real take-home prompts from Stripe, Airbnb, Databricks, with annotated example solutions.

SQL interview round walkthrough→

Window functions, gap-and-island, and the patterns interviewers test in 95% of Data Engineer loops.

Python data manipulation interview prep→

JSON flattening, sessionization, and vanilla-Python data wrangling in the Data Engineer coding round.