Free Resource

Data Engineer Interview Questions and Answers PDF

A free PDF with 100 of the most asked data engineer interview questions and answers, organized by domain: SQL (40), Python (25), data modeling (20), and system design (15). Every question has a worked answer with the reasoning, not just the solution. Sourced from 1,042 real interview reports collected on DataDriven from 2024 to 2026, plus internal mock interview data. Free to download, no email required for the on-page version. Updated April 2026.

The Short Answer
The full PDF is available below. The on-page version is the same content as the PDF, organized for easy scanning. Use the on-page version to study; download the PDF for offline review on your phone or for printing. The questions are tagged by company, level (L3 to L6), and pattern. Each answer includes the gotcha most candidates miss. Pair this with the the full data engineer interview playbook for round-by-round prep context.
Updated April 2026·By The DataDriven Team

What's Inside the PDF

100 questions, organized by domain. Each question has a worked answer with the reasoning, the common wrong answer, and the follow-up the interviewer will ask.

SectionQuestion CountDomains Covered
SQL40Joins, GROUP BY, window functions, CTEs, gap-and-island, recursive queries, optimization
Python25Data wrangling, JSON parsing, deduplication, sessionization, generators, OOP basics, pandas
Data Modeling20Star schema, SCD Type 1/2/3, fact tables, conformed dimensions, medallion architecture
System Design15Streaming pipelines, batch ETL, CDC, exactly-once, schema evolution, backfills
Behavioral (bonus)10STAR-D answers for impact, conflict, ambiguity, failure, leadership

How the Questions Are Sourced and Tagged

Every question in the PDF maps to at least three reported interview loops in our dataset. Tags include: company (when attributable), seniority level (L3, L4, L5, L6), and pattern (e.g., "deduplication", "gap-and-island", "exactly-once semantics"). The tag legend is on page 2 of the PDF.

We exclude questions that appear in a single loop (too noisy) and questions that any L3 candidate could answer in 30 seconds (they don't differentiate). The 100 questions in the PDF are the ones that consistently differentiate L4 candidates from L5 candidates across the dataset.

10 Sample Questions From the PDF

Below are 10 of the 100 questions, with abbreviated answers. The full PDF includes 4-step worked solutions for each, plus the typical follow-up.

SQL · L4

Find users active for 3+ consecutive days

Gap-and-island pattern. ROW_NUMBER per user ordered by date, subtracted from the date itself, gives a per-streak grouping key. GROUP BY user and grouping key, HAVING COUNT >= 3. Walk through with a 5-row example.
SQL · L4

Calculate month-over-month revenue growth percentage

DATE_TRUNC to month grain, SUM, then LAG for previous month. (current - previous) / NULLIF(previous, 0) * 100. NULLIF prevents division-by-zero on first month. Volunteer this edge case.
SQL · L5

Top 3 products by revenue per category, handling ties

DENSE_RANK, not RANK or ROW_NUMBER. DENSE_RANK keeps tied products and doesn't skip ranks. Filter WHERE rank <= 3. Explain why each ranking function is wrong before settling on DENSE_RANK.
Python · L4

Flatten a nested JSON into one level

Recurse over keys. Concatenate parent keys with separator. Decide upfront how to handle lists: explode into rows or serialize. State the decision out loud.
Python · L5

Sessionize events with a 30-min inactivity gap

Sort by user_id and ts. Walk the list. Increment session_id when gap exceeds threshold OR when user changes. State edge case: events with same timestamp.
Modeling · L4

Design a star schema for an e-commerce platform

Grain first: one row per order line item. Fact: order_item_id, order_id, product_id, customer_id, date_id, quantity, unit_price, total. Dims: customer (Type 1 + Type 2 split), product, date. Volunteer 3 trade-offs.
Modeling · L5

Implement SCD Type 2 for a customer dimension

Surrogate key, natural key, valid_from, valid_to, is_current. On update: expire current row, insert new row, change surrogate key. Fact tables join on surrogate key for point-in-time correctness.
Design · L5

Build a real-time clickstream pipeline at 200K events/sec

Kafka (100 partitions, key=user_id) -> Flink (stateful, exactly-once, sessionize) -> S3 + Materialize. Hourly Spark batch to Snowflake. Cover failure modes: TaskManager crash, duplicate events, late data >24h.
Design · L5

Daily reconciliation pipeline for a payments company

Postgres -> Debezium CDC -> Kafka -> S3 raw -> idempotent Spark daily ETL with run_id -> Snowflake. MERGE on (txn_id, run_id). Reruns produce identical output. Audit by source_event_id.
Behavioral · L5

Tell me about a project with measurable impact

STAR-D format: Situation, Task, Action, Result, Decision postmortem. Specific numbers required. End with one thing you'd do differently. The postmortem is the L5 signal.

How to Use the PDF for Effective Prep

1

Tag your weakest domain first

Open the PDF, scan the section headers, and identify the domain you're least confident in. Drill that section first while your energy is high.
2

Practice the answers out loud

Reading is a passive signal that tricks you into thinking you know the answer. Speak each answer to a wall, a phone recorder, or a study partner. Out-loud answers catch the silences that kill live coding rounds.
3

Map questions to the relevant round guide

Every PDF question corresponds to one of the eight rounds in the loop. Pair each question with the matching deep guide: the SQL questions map to SQL interview round walkthrough, the design questions to data pipeline system design interview prep, and so on.
4

Drill the company-specific variants

After the generic question bank, open the relevant company guide: Stripe Data Engineer interview process and questions, Airbnb Data Engineer interview process and questions, Netflix Data Engineer interview process and questions, etc. Company guides cover the questions that show up specifically in that loop.
5

Run the patterns in the sandbox

Reading the answer is not enough. Open our in-browser SQL or Python sandbox and re-implement each answer from scratch. The motor memory of typing the solution is what makes you fast under interview pressure.

Data Engineer Interview Prep FAQ

Is the PDF really free? Do I have to give an email?+
Yes, the PDF is free and downloadable directly from the button above with no email required. We do offer an email-only newsletter for monthly updates and bonus question packs, but it's optional. The questions and answers in the on-page version are identical to the PDF.
How often is the PDF updated?+
Quarterly. The current version is dated April 2026. We update when new question patterns emerge from our interview-report dataset, when companies change their loops materially, or when reader feedback identifies a missing topic. Old versions stay available for reference.
Are these real interview questions?+
The questions are drawn from real reported interview loops, but they are paraphrased and de-identified to protect the candidates who shared them. Direct quotes from copyrighted question banks are not included.
How accurate are the answers?+
Every answer is reviewed by a senior data engineer with FAANG or Series A startup interview experience. Where there are multiple correct approaches, the PDF lists the trade-offs rather than picking one. Where an answer depends on company context, the PDF says so.
Can I share or redistribute the PDF?+
Yes, with attribution. The PDF is licensed Creative Commons Attribution. Bootcamps, study groups, and individual candidates can share freely as long as the source link is preserved.
Does this PDF cover analytics engineer interviews too?+
Partially. About 60% of the SQL and modeling content overlaps with analytics engineer interviews. For analytics-engineer-specific prep, see our analytics engineer interview guide which covers dbt, semantic layers, and BI workflow questions.
Is there a video walkthrough of the answers?+
Not yet. Video walkthroughs of the top 25 questions are on our roadmap for late 2026. The PDF and on-page versions are the canonical source for now.
What level should I be to use this PDF?+
L3 candidates will find the SQL and Python sections most useful. L4 to L5 candidates should focus on the modeling and design sections. L6 candidates use the PDF as a checklist to confirm coverage; the L6 bar is more about depth than breadth.

Practice the Questions in the Browser

Reading the answers is the first step. Run the SQL, write the Python, and design the systems in our in-browser sandbox to build the muscle memory that gets you the offer.

Start Practicing Now

More Data Engineer Interview Prep Guides

Continue your prep

Data Engineer Interview Prep, explore the full guide

50+ guides covering every round, company, role, and technology in the data engineer interview loop. Grounded in 2,817 verified interview reports across 929 companies, collected from real candidates.

Interview Rounds

By Company

By Role

By Technology

Decisions

Question Formats