Pipeline Architecture Mock Interview for Data Engineers

The system design round of a data engineering loop. You get a vague scenario, build the pipeline on an interactive canvas, and then defend the tool picks for the rest of the session against an AI interviewer that asks follow-ups based on what you actually drew.

The Four Phases of the Mock

The session runs about forty-five minutes, split across the same four phases a real onsite uses. Follow-up questions in the discussion phase are generated from your canvas state, so two candidates with different designs will get different conversations.

Think

The opening prompt is intentionally under-specified, similar to what a hiring manager would say in a real round. The phase consists of clarifying questions about volume, latency, source systems, downstream consumers, and budget.

Design

Place ingestion, processing, storage, and serving components on the canvas and wire them together with the appropriate tool at each layer. The canvas state is captured continuously so the discussion phase can reference specific choices.

Discuss

Roughly fifteen minutes of follow-up questions generated from your design. Topics typically include why one tool was chosen over an alternative, what happens during partial failures, how the design handles unexpected schema changes upstream, and how it scales beyond the stated requirements.

Verdict

A summary judgment with a list of the specific design decisions that influenced it, followed by suggested topics to study based on the gaps that came up during the conversation.

ETL versus ELT

Almost every design round touches this, sometimes explicitly and sometimes through tool-choice questions. The expected answer for new pipelines is ELT, because warehouse compute became cheap and elastic and dbt made in-warehouse transformations reproducible. Designs that still benefit from in-flight transformation tend to involve PII that cannot be persisted in raw form for compliance reasons. Some mock prompts include those constraints to surface the trade-off.

Related topics: ETL vs ELT deep dive (/etl-vs-elt), Pipeline architecture patterns (/pipeline/architecture), dbt interview questions (/tools/dbt-interview-questions), Snowflake interview questions (/tools/snowflake-interview-questions)

Prepare for the interview
01 / Open invite
02min.

Know Pipeline Architecture Practice the way the interviewer who asks it knows it.

a Pipeline Architecture Practice query, the same shape a screen would give you.
The diff against expected. Where ties broke. What you missed.
sandbox
1source → bronze → silver → gold
2 ingest : CDC + Kafka
3 transform : dbt + Airflow
4 serve : Snowflake
5
Execute your solution0.4s avg.

Batch versus Streaming

Streaming costs more across operations, on-call burden, debugging, and infrastructure. Part of the question in this section of the interview is whether the candidate examines the latency requirement instead of accepting it on its face. A dashboard that refreshes every five minutes does not require streaming; a micro-batch trigger or a scheduled dbt run is sufficient. True streaming (Kafka with a stream processor like Flink, using event-time processing and watermarks) is justified when sub-second latency directly affects revenue or safety, such as fraud scoring or ad serving.

Spark and PySpark

How much Spark appears in any given interview is determined by the company. Lakehouse teams tend to spend a significant portion of the round on Spark internals, while Snowflake-with-dbt teams rarely touch it. When Spark questions appear, the focus is usually on the execution model: whether the candidate can explain what their code is actually doing at the stage and task level, not whether they can write a DataFrame chain. Databricks rounds layer Delta Lake, Unity Catalog, and the Photon engine on top of those fundamentals.

Kafka, Airflow, dbt

Most modern pipeline interviews touch all three. Kafka is justified when durable replay across multiple consumer groups is genuinely required; for occasional batched event delivery there are cheaper options. Airflow is justified when the number of interrelated jobs and the complexity of their dependencies have outgrown a cron file. dbt is the dominant choice for SQL-based transformation in the warehouse for the same reasons that ELT became dominant. The evaluation reflects the reasoning behind each tool choice rather than the choice itself.

Live Viewers, Live Billing

> We run a live video platform where creators broadcast to thousands of viewers at once. The product team wants real-time viewer counts and chat activity for creators, and the ads team needs accurate impression data for billing. Design a data pipeline for our livestream events.

+ Source
+ Transform
+ Storage
+ Quality
+ Consumer
+ Queue
Bronze
Silver
Gold
Custom
Pipeline Architecture
Sketch the architecture.

Click or drag a node from the toolbar above. Right-click the canvas for the full menu.

Drag from a node's right port to another node's left port to wire data flow.

Recurring Scenarios

A small set of scenarios recurs across companies. Real-time fraud detection focuses the discussion on latency budgets and model serving. Clickstream analytics raises sessionization and event retention. ML feature pipelines surface training-serving skew. CDC replication into a lakehouse leads to questions about handling upstream schema changes. The mock samples from these patterns and also generates novel prompts so candidates do not end up memorizing specific scenarios.

Topic Coverage

Which topics appear in a given session depends on the prompt and the seniority level selected. Fraud-detection prompts at senior levels surface streaming and Kafka questions; mid-level analytics prompts surface ELT and Airflow.

Most rounds

ETL versus ELT

Where the transformation work happens. In current pipelines that work happens inside the warehouse for almost every team, with the exception of pipelines that need to mask or filter data before it lands for compliance reasons. Most interviews ask both for the default and the exception.

Most rounds

Batch versus streaming

The first design decision. Streaming is meaningfully more expensive both in dollars and in operational overhead, so a substantial part of the question is whether the latency requirement actually justifies it. Many requirements stated as real-time turn out to mean refresh every few minutes.

At companies running Spark

Spark and PySpark

Less about API recall, more about whether you can describe how your job actually executes. Stage boundaries, shuffles, broadcast joins, the cost of a Python UDF. At more senior levels the discussion turns toward jobs that have failed in production and the steps taken to fix them.

When the design includes streaming

Kafka

Partitioning, consumer groups, offsets, the trade-offs between at-least-once and exactly-once delivery. The follow-up that often gets missed: when Kafka is the wrong tool, given that the operational overhead is significant for smaller workloads.

When orchestration is involved

Airflow

DAG design, retries that are actually safe to retry, backfills that need to coordinate with an upstream that may also be catching up. Most of the evaluation reflects whether the candidate has run Airflow long enough to have opinions about it.

The umbrella question

End-to-end pipeline design

A fuzzy scenario that requires decomposition into ingestion, processing, storage, and serving layers, with a defensible tool at each. The scenarios that recur include real-time fraud detection, clickstream analytics, ML feature stores, and CDC into a lakehouse.

Notes on Preparation

Reading about pipeline architecture provides a useful baseline but does not transfer directly to interview performance. The session tests the ability to defend a design twenty minutes into the conversation, when fatigue and prior commitments to specific tools start to constrain reasoning. That skill primarily develops through practice in a similar format.

The diagram is the easier part. Producing a reasonable initial design is within reach for most candidates with some experience. The discussion phase is where outcomes diverge, particularly on questions like sustained Kafka consumer lag or an unexpected upstream schema change. Those questions require reasoning during the session rather than recall.

Cost discussion separates levels. Junior responses tend to select tools without elaboration. Senior responses include explanations of why a less expensive alternative would have worked. Staff responses add an approximate cost comparison and an opinion on whether the operational overhead is justified. The follow-up depth in the mock matches the seniority level selected at the start.

Data Pipeline Architecture Interview Questions FAQ

Do interviewers expect ETL or ELT?+
Either is acceptable if you can defend it. In practice almost every new warehouse pipeline written in the last few years lands raw data first and runs the transformations as dbt models. Storage on Snowflake or BigQuery is cheap and the warehouse compute scales without much effort, so the old reason to filter before landing (saving warehouse load) has mostly evaporated. The case to do otherwise is usually regulatory: a column has PII you are not allowed to write to long-term storage, so it gets masked before it hits the warehouse.
What kinds of PySpark questions should I expect?+
Usually two layers. The first is whether you understand that nothing actually runs until you call an action. From there interviewers move into specifics: why coalesce can quietly reduce parallelism, when broadcasting a small dimension avoids a shuffle, what makes a Python UDF expensive (the row-by-row trip across the JVM boundary is the answer). At more senior loops the conversation drifts toward the Spark UI: given a job that takes twice as long as it should, where would you look first.
What kinds of Kafka questions should I expect?+
An interviewer who already accepts Kafka tends to start with the basics (topics, partitions, consumer groups, offsets) and then move to whatever your design implies. If you put Kafka in your diagram, you should be ready to justify it against an SQS or a Kinesis or a simple batched S3 read, all of which are cheaper to operate. If you claimed exactly-once, expect questions about the broker-side cost. Anyone who has run Kafka in production has a war story about consumer group rebalancing, and that story tends to land well.
How would you define data pipeline architecture in a sentence?+
The end-to-end design that moves data from where it is produced to where it is consumed, including the ingestion layer, the processing layer, the storage layer, and the serving layer, plus the orchestration that runs on top. Most architecture interview questions are some variation on choosing the tool at each layer and explaining why.
How does the mock work in practice?+
After selecting Pipeline Architecture as the domain and setting a level and company tier, a design prompt appears. You ask the AI for clarifications, build the pipeline on the canvas, and then move into a discussion phase that lasts about fifteen minutes. The questions in that phase respond to what you drew rather than running from a script. At the end you get a verdict and a summary of which exchanges decided it.
What about Airflow and dbt specifically?+
Airflow questions usually reward operational experience over feature recall. Whether your tasks are safe to retry, what you do when a backfill collides with an upstream that is also catching up, how you handle a DAG that needs to be cleared without orphaning downstream state. dbt questions are tighter in scope: refs, incremental materializations, snapshot strategies, where in CI your tests actually run.
Is it free?+
Yes. There is no subscription and there are no paid features.
02 / Why practice

Start a mock

  1. 01

    Active recall beats re-reading by 50%

    Cognitive-science meta-reviews (Dunlosky et al., 2013) rank practice testing as a top-tier study technique, while re-reading and highlighting rank near the bottom

  2. 02

    76% of hiring managers reject on the coding task, not the resume

    From HackerRank's 2024 Developer Skills Report. Candidates who look strong on paper still fail the live screen if they haven't done timed, executable practice

  3. 03

    Five problem shapes cover 80% of data engineer loops

    Dedup, sessionization, top-N-per-group, slowly-changing dimensions, partition tricks. Writing the shapes by hand turns the unfamiliar into pattern recognition

Related Pipeline and Tools Resources