Pipeline Architecture Interview Questions for Data Engineers

Pipeline architecture is the system design round for data engineers. You get a vague scenario (ten million events a day, fifteen-minute SLA), ask clarifying questions, sketch the end-to-end pipeline, and defend the tool picks while the interviewer keeps adding constraints. It's the round most senior loops are decided on.

How the Mock Simulation Runs

Four phases, in order, same as a real onsite. Think, design, discuss, verdict. The AI interviewer is opinionated and will push back, which is the whole point.

Think (5 min)

A deliberately vague prompt. Your job: ask the questions a real engineer would ask before they touched a whiteboard. Volume, freshness SLA, what sources you have, what the downstream consumers need. The AI interviewer answers in the same fuzzy way a hiring manager would.

Design (20 min)

Build the pipeline on the canvas. Ingestion, processing, storage, serving. Pick the tool for each box and wire the flows. Your design state is captured continuously, so the next phase has something concrete to push on.

Discuss (15 min)

The interviewer starts pushing on your tool picks one at a time. Why Kafka and not SQS. What breaks if a worker dies mid-batch. What you do when the source schema changes overnight. The exchange is iterative, not a single Q-and-A, the same way a real onsite goes.

Verdict (2 min)

Hire or no-hire, with the specific design choices that tipped it. The feedback is concrete (the missing dead-letter queue, the over-provisioned cluster, the unaccounted-for late-arriving data) so the next attempt has somewhere to start.

The Six Pattern Families That Show Up in PA Rounds

Almost every PA prompt is a recombination of these six. Pipeline shape, storage, distributed compute, batch versus streaming, reliability, and incremental loading. The frequencies are from our corpus of real interview debriefs.

1. Pipeline Design Fundamentals (Medium / Around 7 in 10 PA rounds). End-to-end design from source to serving. The check is whether you decompose a fuzzy requirement into the four layers (ingest, transform, store, serve) and pick a defensible tool at each. Almost every PA scenario is a variant of this shape. Topics: Source system integration patterns (CDC, API polling, event streams). Ingestion layer design (push vs pull, full vs incremental). Transformation strategy (ETL vs ELT, when each is appropriate). Serving layer selection (warehouse, feature store, API cache). End-to-end latency estimation and SLA definition
2. Storage Architecture (Medium-Hard / Roughly 6 in 10 PA rounds). Where data lives decides what queries are cheap, what queries are slow, and how painful a schema change becomes. The honest answer to most prompts is 'lakehouse on Parquet or Iceberg,' but the interview is about why, not what. Topics: Data lake vs data warehouse vs lakehouse trade-offs. File format selection (Parquet, Avro, ORC, Delta, Iceberg). Partitioning and clustering strategies for query performance. Hot/warm/cold storage tiering and retention policies. Schema-on-read vs schema-on-write and when each applies
3. Spark Deep Dive (Hard / About 4 in 10 PA rounds). If the company runs Spark in production (Databricks, Netflix, Uber, most petabyte-scale lakehouses), expect at least one Spark question that goes deeper than DataFrame syntax. Execution model, shuffles, partitioning, the skew question. Topics: Spark execution model (driver, executors, stages, tasks). Shuffle operations and why they are expensive. Partitioning strategies and data skew mitigation. Broadcast joins vs sort-merge joins (when to use each). Memory management and spill-to-disk behavior
4. Batch vs Streaming (Medium-Hard / About 6 to 7 of every 10 PA rounds). The first fork. The candidates who pass push back on 'real-time' before they pick a tool, because most 'real-time' requirements turn out to be a five-minute refresh that a micro-batch will satisfy. The candidates who fail jump straight to Kafka. Topics: When batch is the right choice (and why choosing it shows maturity). Streaming cost model (3-10x batch for same data volume). Lambda vs Kappa architecture trade-offs. Micro-batch as a middle ground (Spark Structured Streaming). Exactly-once vs at-least-once delivery guarantees
5. Reliability and Fault Tolerance (Hard / About half of PA rounds). The questions that separate someone who has built pipelines from someone who has read about them. Idempotency, dead letter queues, replay, exactly-once semantics, what happens on the third retry. The wrong answer is anything that starts with 'we'd just rerun it.' Topics: Idempotent pipeline design (safe to re-run without side effects). Exactly-once semantics in distributed systems. Dead letter queues and poison message handling. Backfill strategies for historical data reprocessing. Circuit breaker patterns for upstream dependency failures
6. Incremental Loading (Medium-Hard / About 4 to 5 of every 10 PA rounds). Full reloads work until the dataset crosses some threshold and then they don't. The questions push on how you handle late-arriving rows, soft deletes you didn't know about, source schema drift, and the merge semantics that keep your output correct on the second run. Topics: Change Data Capture (CDC) patterns and tools. Watermark-based incremental processing. Handling late-arriving and out-of-order data. Merge (upsert) strategies for slowly changing sources. Schema evolution and backward/forward compatibility

5 Pipeline Architecture Scenarios with Full Walkthroughs

Each scenario includes the interview prompt, key design decisions, trade-off analysis, and the follow-up questions the interviewer will ask.

Scenario 1 (Hard): Design a real-time fraud detection pipeline. Prompt: "A payments company processes 50,000 transactions per second. They need to flag potentially fraudulent transactions within 200ms. Design the end-to-end pipeline." | Key Decisions: Streaming-first architecture (batch is disqualified by the 200ms SLA). Kafka for ingestion (high throughput, replay capability for model retraining). Flink or Spark Structured Streaming for feature computation. Feature store (Redis or DynamoDB) for sub-millisecond lookups. Async enrichment pipeline for model feedback loop | Trade-offs: The push will be on false positives. Blocking a legitimate transaction loses revenue at the moment of sale. Not blocking real fraud destroys trust over months. A serious answer covers both the real-time scoring path and the human-in-the-loop review queue, and admits that the precision-recall trade-off is a business call, not an architecture one. | Follow-ups: What happens when Kafka consumer lag exceeds your SLA? | How do you retrain the model without downtime? | What if the feature store goes down?
Scenario 2 (Medium-Hard): Build a data warehouse for an e-commerce platform. Prompt: "An e-commerce company has 10M daily orders across 5 source systems (orders, inventory, customers, products, shipping). Build the warehouse architecture." | Key Decisions: ELT pattern (land raw data first, transform in the warehouse). Medallion architecture (bronze/silver/gold layers). Star schema with conformed dimensions across business domains. dbt for transformation orchestration with data quality tests. Airflow for end-to-end pipeline scheduling with SLA monitoring | Trade-offs: The push will be freshness against cost. Re-materializing every gold table every hour is expensive enough to notice on the bill. Daily is cheaper but analysts complain. The defensible answer is tiered: the three or four executive dashboards refresh hourly, everything else lands on the daily window. Be specific about the dollar gap between the two so the trade-off is real, not aesthetic. | Follow-ups: How do you handle late-arriving orders from the shipping system? | What is your strategy for slowly changing product dimensions? | How do you backfill 6 months of historical data without breaking production?
Scenario 3 (Hard): Design a clickstream analytics pipeline. Prompt: "A media company with 100M monthly active users needs to track every page view, click, and video play event for product analytics and personalization. Design the pipeline." | Key Decisions: Event collection via CDN-edge SDK with client-side batching. Kafka with topic-per-event-type for flexible consumption. Spark Structured Streaming for sessionization and real-time aggregation. S3/GCS data lake with Iceberg for mutable analytics tables. BigQuery or Snowflake serving layer for analyst self-serve queries | Trade-offs: Scale is the whole question. 100M MAU at 20 events per user per day is 2 billion events per day, roughly 23K events per second sustained with 3 to 5x peak spikes during prime time. You size for peak, not average. The follow-up will be retention: storing every raw event forever blows the budget within a year. Set the policy explicitly (raw at 90 days, aggregates forever) and own it. | Follow-ups: How do you handle ad blockers that prevent event collection? | What partitioning strategy gives analysts fast queries on this data? | How do you deduplicate events from retry-prone mobile clients?
Scenario 4 (Medium): Migrate a legacy ETL pipeline to a modern stack. Prompt: "A financial services company has 200 stored procedures running nightly in SQL Server. They want to move to a cloud-native architecture. Plan the migration." | Key Decisions: Lift-and-shift first, refactor second (reduce risk, build confidence). Map stored procedures to dbt models (SQL-to-SQL translation). Airflow for orchestration (replacing SQL Agent jobs). Snowflake or BigQuery as the target warehouse. Data quality framework to validate parity between old and new outputs | Trade-offs: Migration strategy is the entire question. Big-bang cuts faster but breaks loudly when it breaks, and at a financial services company the breakage costs more than the savings. Parallel running keeps both stacks live, doubles your compute bill for two to four quarters, but lets you compare outputs before turning the old one off. Name both, recommend parallel, and say why. | Follow-ups: How do you validate that the new pipeline produces identical results? | What do you do when a stored procedure has undocumented side effects? | How do you handle the cutover for downstream consumers?
Scenario 5 (Hard): Design a feature store for ML model serving. Prompt: "An ML platform team serves 15 models in production. Feature computation is duplicated across teams. Design a centralized feature store." | Key Decisions: Dual-compute architecture: batch features (Spark) + real-time features (Flink/streaming). Online store (Redis/DynamoDB) for sub-10ms serving at prediction time. Offline store (data lake/warehouse) for training dataset generation. Feature registry with versioning, ownership, and lineage metadata. Point-in-time-correct joins to prevent training-serving skew | Trade-offs: Training-serving skew is where everything fails. If the feature value a model trained on doesn't match what it sees at request time, model quality degrades silently and nobody notices for weeks. The architecture has to guarantee parity, usually with the same code path computing features both offline (for training tables) and online (for serving). Anything that has two separate computation paths for the same feature is wrong on sight. | Follow-ups: How do you detect training-serving skew in production? | What happens when a feature definition changes and 5 models depend on it? | How do you handle features that require joins across multiple source tables at serving time?

How to Actually Prep for This Round

This is the round where senior loops actually get decided. The rejection rate is brutal because coding-round prep doesn't transfer. The skill being tested is reasoning under constraint, and it takes practice to build.

You can't cram this round. SQL syntax you can memorize. Pipeline trade-offs you have to talk through under pressure, with someone pushing back, until the loop becomes muscle. Two weeks of mock rounds beats two months of reading.

Read about it then talk it out. Knowing that micro-batch beats true streaming when the SLA allows it isn't the same as defending that choice when the interviewer says 'but the head of growth wants real-time.' The transferable skill is the conversation, not the recall.

Speak the vocabulary fluently. Idempotency, exactly-once, backpressure, schema evolution, data skew, watermarks. If these words don't come out naturally, the interviewer's read is that you haven't shipped this work. The fix is using them in answers, out loud, until they're automatic.

Start with constraints, not tools. The most common failure mode at senior is jumping to Kafka before asking how many events per second the system actually needs. The right first move is always volume, freshness SLA, and budget. Those three numbers narrow the tool space dramatically before you draw a single box.

Carry rough numbers. Back-of-envelope math is the part interviewers score most kindly. A single broker handles around 1M messages per second. A modest Spark cluster chews through about 1 TB per hour. Streaming usually costs 3 to 10x batch for the same throughput. Knowing these in your head is what distinguishes you from someone who's read about pipelines but never sized one.

Prepare for the interview

01 / Open invite

02min.

Know Pipeline Architecture Interview the way the interviewer who asks it knows it.

a Pipeline Architecture Interview query, the same shape a screen would give you.

The diff against expected. Where ties broke. What you missed.

sandbox

1source → bronze → silver → gold

2 ingest : CDC + Kafka

3 transform : dbt + Airflow

4 serve : Snowflake

Execute your solution0.4s avg.

PayPalInterview question

Solve a Pipeline Architecture Interview problem

Pipeline Architecture Interview FAQ

What is a pipeline architecture interview?+

The DE equivalent of the software system design round. You get a vague scenario, ask clarifying questions, sketch an end-to-end pipeline, and defend the trade-offs as the interviewer pokes at it. There's no single right answer. The interviewer is evaluating reasoning under constraint, not whether you produced their favorite architecture.

How is it different from a software system design round?+

Software system design optimizes for low latency and high availability under spiky request traffic: load balancers, caches, replicated databases. Pipeline architecture optimizes for throughput, data quality, freshness, and cost under continuous data flow: Kafka, Spark, Airflow, dbt, warehouses. The vocabulary and the failure modes are different. The skill of decomposing a fuzzy problem into bounded components is the same.

What topics get tested?+

Six clusters: end-to-end pipeline shape, storage architecture and file formats, batch versus streaming, distributed processing (mostly Spark execution model), reliability and fault tolerance, and incremental loading with CDC. Most 45-minute rounds touch two or three of these, with one as the headline.

How should I structure my answer?+

First five minutes: clarify volume, freshness SLA, sources, consumers. Next ten: sketch the four layers on a whiteboard. Next twenty: walk the data flow end to end, justifying each box. Last ten: failure modes, monitoring, cost. Most candidates spend too long drawing and not enough time on the back half, which is where the evaluation actually happens.

How is pipeline architecture different from ETL design?+

ETL is a subset. ETL is 'how do I get data from A to B and transform it on the way.' Pipeline architecture is the whole platform: ingest, transform, store, serve, orchestrate, monitor, plus the cost model that decides which of those layers gets the budget. An ETL question is a 30-minute warm-up. A pipeline architecture question is the entire round.

How often does this round show up in DE loops?+

In our corpus, roughly half of all DE loops include an explicit pipeline architecture round. At the senior level it's closer to 80%. Junior and mid loops lean on SQL and Python and might fold a lightweight design question into one of those rounds. By staff level, design is the entire onsite.

What tools should I be able to talk about?+

Kafka for streaming ingestion, Spark for distributed batch and micro-batch, Airflow or Dagster for orchestration, dbt for SQL transformations, and at least one of Snowflake / BigQuery / Databricks for the warehouse. You don't need to be deep on all of them, but you do need to be able to justify picking one over another when the prompt asks.

What PySpark questions come up?+

DataFrame versus RDD and why DataFrame won (Catalyst optimizer, columnar internals). Transformations versus actions and the lazy evaluation it enables. Repartition versus coalesce. Broadcast joins for skew. UDF performance traps (Python serialization is the killer). At senior, AQE, dynamic partition pruning, and reading the Spark UI to find the slow stage.

What Kafka questions come up?+

Topics, partitions, consumer groups, offsets. Why partitioning gives you parallelism and ordering at the same time. At-least-once versus exactly-once. Why consumer group rebalance is the source of every weird latency spike. Log compaction versus time-based retention. Schema evolution with a registry. At senior, Kafka Connect for CDC, Kafka Streams versus Flink, and capacity planning.

ETL versus ELT, in one paragraph?+

ETL transforms in flight before landing. ELT lands raw and transforms in place using the warehouse's compute. ELT became the default once Snowflake and BigQuery made warehouse compute cheap and storage cheaper. The honest exception: if you have a regulatory reason to mask PII before it lands, that transform happens in flight and the rest happens in-warehouse. Most 2026 stacks are ELT with dbt as the in-warehouse transformation layer.

What Databricks questions come up?+

Delta Lake for ACID on top of Parquet, Unity Catalog for governance, the medallion pattern for organizing the lakehouse. The deeper question is usually whether you understand why Delta solved the small-file problem and the transactional consistency problem that made vanilla data lakes painful. At senior, Photon, cluster autoscaling, and the trade-off between Databricks notebooks for dev and Workflows for production.

02 / Why practice

Try a full mock end to end

01
Active recall beats re-reading by 50%
Cognitive-science meta-reviews (Dunlosky et al., 2013) rank practice testing as a top-tier study technique, while re-reading and highlighting rank near the bottom
02
76% of hiring managers reject on the coding task, not the resume
From HackerRank's 2024 Developer Skills Report. Candidates who look strong on paper still fail the live screen if they haven't done timed, executable practice
03
Five problem shapes cover 80% of data engineer loops
Dedup, sessionization, top-N-per-group, slowly-changing dimensions, partition tricks. Writing the shapes by hand turns the unfamiliar into pattern recognition

Start a mock

Pipeline Architecture Topics

PySpark Questions→

DataFrames, UDFs, partitioning, performance

Spark Questions→

Execution model, shuffles, partitioning

Kafka Questions→

Topics, partitions, consumer groups, replay

Airflow Questions→

DAG design, scheduling, backfill, operators

Databricks Questions→

Delta Lake, Unity Catalog, lakehouse

Snowflake Questions→

Virtual warehouses, time travel, clustering

dbt Questions→

Models, tests, materializations, incremental

ETL vs ELT→

Where transformation happens and why it matters

Batch vs Streaming→

The first fork in every pipeline design

Data Pipeline Architecture→

End-to-end pipeline design patterns

Idempotent Pipelines→

Design for safe re-runs and failure recovery

Architecture Patterns→

Lambda, Kappa, event-driven, request-driven

Related Interview Guides

DE Interview Prep→

The full loop: SQL, Python, modeling, system design, behavioral

SQL Interview Questions→

50+ questions with worked solutions and explanations

Data Modeling Questions→

Star schemas, normalization, SCD types