Pipeline Architecture Mock Interview for Data Engineers
The system design round of a data engineering loop. You get a vague scenario, build the pipeline on an interactive canvas, and then defend the tool picks for the rest of the session against an AI interviewer that asks follow-ups based on what you actually drew.
The Four Phases of the Mock
The session runs about forty-five minutes, split across the same four phases a real onsite uses. Follow-up questions in the discussion phase are generated from your canvas state, so two candidates with different designs will get different conversations.
Think
The opening prompt is intentionally under-specified, similar to what a hiring manager would say in a real round. The phase consists of clarifying questions about volume, latency, source systems, downstream consumers, and budget.
Design
Place ingestion, processing, storage, and serving components on the canvas and wire them together with the appropriate tool at each layer. The canvas state is captured continuously so the discussion phase can reference specific choices.
Discuss
Roughly fifteen minutes of follow-up questions generated from your design. Topics typically include why one tool was chosen over an alternative, what happens during partial failures, how the design handles unexpected schema changes upstream, and how it scales beyond the stated requirements.
Verdict
A summary judgment with a list of the specific design decisions that influenced it, followed by suggested topics to study based on the gaps that came up during the conversation.
ETL versus ELT
Almost every design round touches this, sometimes explicitly and sometimes through tool-choice questions. The expected answer for new pipelines is ELT, because warehouse compute became cheap and elastic and dbt made in-warehouse transformations reproducible. Designs that still benefit from in-flight transformation tend to involve PII that cannot be persisted in raw form for compliance reasons. Some mock prompts include those constraints to surface the trade-off.
Related topics: ETL vs ELT deep dive (/etl-vs-elt), Pipeline architecture patterns (/pipeline/architecture), dbt interview questions (/tools/dbt-interview-questions), Snowflake interview questions (/tools/snowflake-interview-questions)
Know Pipeline Architecture Practice the way the interviewer who asks it knows it.
Batch versus Streaming
Streaming costs more across operations, on-call burden, debugging, and infrastructure. Part of the question in this section of the interview is whether the candidate examines the latency requirement instead of accepting it on its face. A dashboard that refreshes every five minutes does not require streaming; a micro-batch trigger or a scheduled dbt run is sufficient. True streaming (Kafka with a stream processor like Flink, using event-time processing and watermarks) is justified when sub-second latency directly affects revenue or safety, such as fraud scoring or ad serving.
Spark and PySpark
How much Spark appears in any given interview is determined by the company. Lakehouse teams tend to spend a significant portion of the round on Spark internals, while Snowflake-with-dbt teams rarely touch it. When Spark questions appear, the focus is usually on the execution model: whether the candidate can explain what their code is actually doing at the stage and task level, not whether they can write a DataFrame chain. Databricks rounds layer Delta Lake, Unity Catalog, and the Photon engine on top of those fundamentals.
Kafka, Airflow, dbt
Most modern pipeline interviews touch all three. Kafka is justified when durable replay across multiple consumer groups is genuinely required; for occasional batched event delivery there are cheaper options. Airflow is justified when the number of interrelated jobs and the complexity of their dependencies have outgrown a cron file. dbt is the dominant choice for SQL-based transformation in the warehouse for the same reasons that ELT became dominant. The evaluation reflects the reasoning behind each tool choice rather than the choice itself.
Live Viewers, Live Billing
Click or drag a node from the toolbar above. Right-click the canvas for the full menu.
Drag from a node's right port to another node's left port to wire data flow.
Recurring Scenarios
A small set of scenarios recurs across companies. Real-time fraud detection focuses the discussion on latency budgets and model serving. Clickstream analytics raises sessionization and event retention. ML feature pipelines surface training-serving skew. CDC replication into a lakehouse leads to questions about handling upstream schema changes. The mock samples from these patterns and also generates novel prompts so candidates do not end up memorizing specific scenarios.
Topic Coverage
Which topics appear in a given session depends on the prompt and the seniority level selected. Fraud-detection prompts at senior levels surface streaming and Kafka questions; mid-level analytics prompts surface ELT and Airflow.
ETL versus ELT
Where the transformation work happens. In current pipelines that work happens inside the warehouse for almost every team, with the exception of pipelines that need to mask or filter data before it lands for compliance reasons. Most interviews ask both for the default and the exception.
Batch versus streaming
The first design decision. Streaming is meaningfully more expensive both in dollars and in operational overhead, so a substantial part of the question is whether the latency requirement actually justifies it. Many requirements stated as real-time turn out to mean refresh every few minutes.
Spark and PySpark
Less about API recall, more about whether you can describe how your job actually executes. Stage boundaries, shuffles, broadcast joins, the cost of a Python UDF. At more senior levels the discussion turns toward jobs that have failed in production and the steps taken to fix them.
Kafka
Partitioning, consumer groups, offsets, the trade-offs between at-least-once and exactly-once delivery. The follow-up that often gets missed: when Kafka is the wrong tool, given that the operational overhead is significant for smaller workloads.
Airflow
DAG design, retries that are actually safe to retry, backfills that need to coordinate with an upstream that may also be catching up. Most of the evaluation reflects whether the candidate has run Airflow long enough to have opinions about it.
End-to-end pipeline design
A fuzzy scenario that requires decomposition into ingestion, processing, storage, and serving layers, with a defensible tool at each. The scenarios that recur include real-time fraud detection, clickstream analytics, ML feature stores, and CDC into a lakehouse.
Notes on Preparation
Reading about pipeline architecture provides a useful baseline but does not transfer directly to interview performance. The session tests the ability to defend a design twenty minutes into the conversation, when fatigue and prior commitments to specific tools start to constrain reasoning. That skill primarily develops through practice in a similar format.
The diagram is the easier part. Producing a reasonable initial design is within reach for most candidates with some experience. The discussion phase is where outcomes diverge, particularly on questions like sustained Kafka consumer lag or an unexpected upstream schema change. Those questions require reasoning during the session rather than recall.
Cost discussion separates levels. Junior responses tend to select tools without elaboration. Senior responses include explanations of why a less expensive alternative would have worked. Staff responses add an approximate cost comparison and an opinion on whether the operational overhead is justified. The follow-up depth in the mock matches the seniority level selected at the start.
Data Pipeline Architecture Interview Questions FAQ
Do interviewers expect ETL or ELT?+
What kinds of PySpark questions should I expect?+
What kinds of Kafka questions should I expect?+
How would you define data pipeline architecture in a sentence?+
How does the mock work in practice?+
What about Airflow and dbt specifically?+
Is it free?+
Start a mock
- 01
Active recall beats re-reading by 50%
Cognitive-science meta-reviews (Dunlosky et al., 2013) rank practice testing as a top-tier study technique, while re-reading and highlighting rank near the bottom
- 02
76% of hiring managers reject on the coding task, not the resume
From HackerRank's 2024 Developer Skills Report. Candidates who look strong on paper still fail the live screen if they haven't done timed, executable practice
- 03
Five problem shapes cover 80% of data engineer loops
Dedup, sessionization, top-N-per-group, slowly-changing dimensions, partition tricks. Writing the shapes by hand turns the unfamiliar into pattern recognition
Related Pipeline and Tools Resources
Six pattern families with worked scenarios
Framework for the DE system design round
Where transformation happens and why it matters
The first fork in every pipeline design
DataFrames, UDFs, partitioning, performance
Execution model, shuffles, partitioning
Topics, partitions, consumer groups, replay
DAG design, scheduling, backfill, operators
Models, tests, materializations, incremental
Delta Lake, Unity Catalog, lakehouse
Operational depth: debugging, backfill, schema drift
SQL, Python, and data modeling problems