Data Engineer Interview Questions and Answers
One hundred data engineering interview questions with worked answers. The page you searched for asked for a PDF; the honest answer is that something live and runnable is better than a stale download. Browse them here, run them in the sandbox, or read the open-source repo on GitHub.
On the PDF
You searched for a PDF. There isn't one, and the honest reason is that a static PDF of interview questions ages badly. The questions interviewers ask shift, the dialect specifics shift, the right answer to "should I learn dbt first or Airflow" shifts. A page that's updated when those things change is more useful than a snapshot you downloaded in March.
The same content lives in three places that are better than a PDF would be:
- This site, where you can read the questions and run the SQL and Python with live scoring in the same tab. See the top 100 page.
- An open-source GitHub repo if you want to read offline or print your own version: datadriven-io/data-engineering-interview-questions. The repo is updated quarterly. Clone it, paste into your printer, you have your PDF.
- A longer study guide with a weekly schedule, self-assessment rubric, and journal template: data-engineering-interview-handbook.
What's in the question set
One hundred questions, weighted by how often each one shows up in real loops. Each has a worked answer and the typical interviewer follow-up.
| Section | Count | Topics |
|---|---|---|
| SQL | 40 | Joins, GROUP BY, window functions, CTEs, gap-and-island, recursive queries |
| Python | 25 | Wrangling, JSON parsing, dedup, sessionization, pandas, generators |
| Data Modeling | 20 | Star schema, SCD Type 1/2/3, conformed dimensions, medallion architecture |
| System Design | 15 | Streaming, batch ETL, CDC, exactly-once, schema evolution, backfills |
| Behavioral | 5 | STAR answers for impact, conflict, ambiguity, failure, leadership |
Ten sample questions
Ten of the hundred, with the answer tight enough to fit in the few sentences you'd actually use in the interview.
Find users active for three or more consecutive days
Month-over-month revenue growth percentage
Top three products by revenue per category, handling ties
Flatten a nested JSON into one level
Sessionize events with a 30-minute inactivity gap
Star schema for an e-commerce platform
SCD Type 2 for a customer dimension
Clickstream pipeline at 200K events per second
Daily reconciliation pipeline for a payments company
A project with measurable impact
How to use the set
- 01
Start with your weakest domain
Scan the headers, pick the domain you're least confident in, drill that one first while your energy is high. The domain you avoid is the one that takes you down. - 02
Practice the answers out loud
Silent reading produces a sense of familiarity that doesn't carry over to verbal explanation under pressure. Say each answer to a wall or a phone recorder. The gaps that come out as hesitation in live rounds show up here first. - 03
Pair questions with the round guide
SQL questions go with SQL interview round walkthrough, design questions go with data pipeline system design interview prep, modeling with schema design interview walkthrough. The round guides explain what the interviewer is actually scoring; the questions are the practice reps. - 04
Drill the company variants if you have a target
After the generic set, open the relevant company guide: Stripe Data Engineer interview process and questions, Airbnb Data Engineer interview process and questions, Netflix Data Engineer interview process and questions, Databricks Data Engineer interview process and questions. Company guides cover the questions that show up in that specific loop and where the round structure deviates. - 05
Run the patterns, don't just read them
Re-implementing each answer from scratch in the in-browser sandbox builds the typing fluency a live round demands. Reading is recognition; writing is recall, and only recall transfers.
Data engineer interview prep FAQ
Why is the URL 'pdf' if there isn't one?+
Is everything free?+
How often is the question set updated?+
Are these from real interviews?+
How accurate are the answers?+
Can I share the GitHub repo?+
Does this cover analytics engineering interviews?+
What level should I be?+
Run the questions, don't just print them
- 01
Active recall beats re-reading by 50%
Cognitive-science meta-reviews (Dunlosky et al., 2013) rank practice testing as a top-tier study technique, while re-reading and highlighting rank near the bottom
- 02
76% of hiring managers reject on the coding task, not the resume
From HackerRank's 2024 Developer Skills Report. Candidates who look strong on paper still fail the live screen if they haven't done timed, executable practice
- 03
Five problem shapes cover 80% of data engineer loops
Dedup, sessionization, top-N-per-group, slowly-changing dimensions, partition tricks. Writing the shapes by hand turns the unfamiliar into pattern recognition
Where to read next
More data engineer interview prep guides
The 50 most frequently asked data engineer interview questions, with worked answers.
100 of the most asked data engineer interview questions across all four domains.
Real questions from Meta, Amazon, Apple, Netflix, and Google Data Engineer loops, with answers.
Real take-home prompts from Stripe, Airbnb, Databricks, with annotated example solutions.
Window functions, gap-and-island, and the patterns interviewers test in 95% of Data Engineer loops.
JSON flattening, sessionization, and vanilla-Python data wrangling in the Data Engineer coding round.