# Pipeline Architecture Practice Problems

> Architecture-decision practice problems for the data engineer system design round.

Canonical URL: <https://datadriven.io/pipeline-architecture-practice-problems>

Breadcrumb: [Home](https://datadriven.io/) > [Pipeline Architecture Practice Problems](https://datadriven.io/pipeline-architecture-practice-problems)

## Summary

Pipeline architecture practice problems for data engineer interview prep. Each problem forces the four decisions the diagram encodes: batch versus streaming at a quantitative freshness threshold, ETL versus ELT transformation placement, delivery semantics (at-least-once with idempotent writes versus exactly-once), and the failure modes the prompt makes load-bearing. Scenarios span clickstream-to-warehouse, CDC from a production database, near-real-time fraud, sessionization, multi-region failover, and daily revenue close. Rubric-scored verdicts call out the tool reflex that does not survive the constraint.

## What this page covers

Pipeline architecture practice problems isolate the decision rather than the diagram. Most candidates can draw a box-and-arrow architecture; the L5+ rubric scores whether the boxes match the constraint. The four load-bearing decisions, drilled across every problem in the bank, are batch-versus-streaming placement, ETL-versus-ELT transformation placement, delivery semantics, and named failure modes. A passing design states each in terms of the constraint that forces it, not the tool the candidate used last year.

Batch versus streaming is the decision candidates get wrong most often, and it is quantitative, not aesthetic. If the freshness SLA is over five minutes, batch or micro-batch is almost always cheaper and operationally simpler: land raw to S3 partitioned by hour, run a dbt incremental or a Spark Structured Streaming job on a one-to-five-minute trigger into Snowflake or BigQuery, serve from the warehouse. If the SLA is under one minute, streaming is forced: Kafka into Flink or Spark Structured Streaming with watermarking, low-latency sink to Redis, Druid, or ClickHouse. Between one and five minutes is where judgment lives, and where candidates default to streaming when micro-batch satisfies the SLA at a fraction of the cost. The classic losing answer wires Kafka into Flink for a fifteen-minute dashboard; the hire-signal answer names fifteen minutes as a micro-batch SLA and revisits only if product wants sub-minute.

ETL versus ELT is transformation placement. ELT is the 2026 default because columnar warehouse compute (Snowflake, BigQuery, Redshift, Databricks) is cheap and the raw bronze layer enables replay: land raw, transform in-warehouse with dbt or Spark, model bronze to silver to gold. ETL still wins when raw data is sensitive and cannot land unmasked, when the warehouse is undersized, or when downstream needs data pre-shaped. The interviewer rarely cares which the candidate picks; they score whether the candidate can name the conditions that flip the answer.

Delivery semantics is the third decision. At-least-once with idempotent writes (run_id baked into output partitions, MERGE INTO on a composite natural key) is the right answer for nearly every internal pipeline, because the business cares about exactly-once effect, not exactly-once message delivery. Exactly-once at the message level is real but expensive (Kafka transactions, Flink two-phase-commit sinks, checkpoint coordination); pull it out only when downstream genuinely cannot deduplicate. At-most-once fits fire-and-forget telemetry, which is rarer than candidates assume.

Failure modes are the senior signal. For each component, name what happens when it dies, when it gets backed up, and when the upstream schema changes, before the interviewer prompts. Kafka: broker death (replication factor 3, ISR), partition skew, consumer lag. Spark Structured Streaming: executor OOM, watermark too aggressive dropping late data, checkpoint corruption. Snowflake MERGE: deadlock with a concurrent writer, partition not yet committed, schema drift caught at the registry. Late-arriving data, replay after an upstream outage, and source-side dedup recur across prompts. Companies that weight architecture decisions heavily in data engineer rounds: Netflix (Spark and Iceberg with late-arriving data), Stripe (idempotent reconciliation, exactly-once effect), Meta (28-day attribution windows on 10B+ events per day), Amazon (AWS-native Kinesis to Firehose to S3 to Glue to Redshift), Google (GCP-native Pub/Sub to Dataflow to BigQuery).

## Frequently asked questions

### What does a pipeline architecture practice problem test that a diagram alone does not?

The decision behind the diagram. Most candidates can draw boxes and arrows; the rubric scores whether the boxes match the constraint. Four decisions are drilled on every problem: batch versus streaming at a quantitative freshness threshold, ETL versus ELT transformation placement, delivery semantics, and the failure modes the prompt makes load-bearing. A correct-looking diagram with the wrong batch-versus-streaming call still fails.

### How is the batch-versus-streaming decision scored?

Quantitatively, against the freshness SLA in the prompt. Over five minutes: batch or micro-batch is almost always the cheaper, simpler answer. Under one minute: streaming is forced. Between one and five minutes is judgment territory. The single most common failed-round pattern is defaulting to streaming (Kafka plus Flink) when a fifteen-minute dashboard SLA is a micro-batch problem that a dbt incremental solves at a fraction of the cost.

### When does ETL beat ELT in these problems?

ELT is the 2026 default because warehouse compute is cheap and a raw bronze layer enables replay. ETL still wins in three cases: raw data is sensitive and cannot land in the warehouse unmasked, the warehouse is undersized for the transform, or downstream consumers need the data already shaped. The rubric rewards naming the condition that flips the answer, not the answer itself.

### What delivery semantics should I default to?

At-least-once with idempotent writes: run_id baked into output partitions plus MERGE INTO on a composite natural key. The business cares about exactly-once effect, not exactly-once message delivery, and idempotency gets you there cheaply. Reach for true exactly-once (Kafka transactions, Flink two-phase-commit sinks) only when downstream genuinely cannot deduplicate. At-most-once fits fire-and-forget telemetry and little else.

### How do I show senior signal on failure modes?

Name them before the interviewer asks. For each component, state what happens when it dies, when it backs up, and when the upstream schema changes. Kafka: broker death handled by replication factor 3, partition skew, consumer lag. Spark Structured Streaming: executor OOM, over-aggressive watermark dropping late data, checkpoint corruption. Snowflake MERGE: deadlock, uncommitted partition, schema drift. Pick the one or two relevant to the prompt and design for them.

### Which scenarios are in the pipeline architecture bank?

Clickstream into a warehouse (batch-versus-streaming threshold), CDC from a production database (Debezium versus read replica, schema evolution), near-real-time fraud detection (genuine streaming, exactly-once), sessionization at scale (stateful streaming, late events), multi-region failover (active-active versus active-passive, RPO and RTO), daily revenue close (idempotency and reconciliation), embedded analytics (workload isolation, caching), and legacy ETL to dbt migration (dependency graph, diff-test cutover).

### Should I bring up cost in an architecture round?

Yes, briefly, and after the design is sketched. Bringing up cost too early reads as cost-anxiety; never mentioning it reads as inexperience. The senior move is one sentence after the sketch: 'I'd estimate this at low hundreds a month at the stated volume; if cost is constrained, the next move is X.' Cost is a first-class rubric dimension because an over-provisioned design that meets the SLA still loses to a tight one that also meets it.

### How many pipeline architecture problems should I practice before an onsite?

Eight to twelve well-practiced scenarios across the recurring shapes beats twenty rushed ones. The signal interviewers test is recognizing the prompt shape inside the first minute and reaching for the constraint-matched design, not the famous tool. Volume matters less than transferring the four decisions to a new source-transform-serve combination.

## How a data engineer attacks a pipeline architecture problem

Six-step framework that scores the four load-bearing architecture decisions.

### Step 1: Read the SLA as a number

Convert freshness into a threshold: over 5 min is batch or micro-batch, under 1 min forces streaming, 1-5 min is judgment. State the threshold the answer turns on.

### Step 2: Place the transformation

ELT by default (land raw, transform in-warehouse). Pick ETL only when raw is sensitive, the warehouse is undersized, or downstream needs pre-shaped data. Name the condition that flips it.

### Step 3: Choose delivery semantics

At-least-once plus idempotent writes (run_id, MERGE INTO) for nearly everything. Exactly-once only when downstream cannot dedup. State why.

### Step 4: Name the failure modes

For each component: dies, backs up, schema changes. Pick the one or two the prompt makes load-bearing (late data, replay, dedup) and design for them.

### Step 5: Price it in one sentence

Back-of-envelope cost band after the sketch. An over-provisioned design that meets the SLA still loses to a tight one that also meets it.

### Step 6: Adapt to the pivot

When the interviewer tightens the SLA or jumps the volume, modify the existing design in place and articulate which decision moved.

## Related practice catalogs

- [Data pipeline practice problems](https://datadriven.io/data-pipeline-practice-problems): End-to-end ingest, transform, serve design across the same scenarios.
- [Data engineer system design interview prep](https://datadriven.io/system-design-interview-prep): Full prep guide for the architecture round rubric.
- [Data engineer system design problems](https://datadriven.io/data-engineer-system-design-problems): Rubric-scored end-to-end design problems for L5+ roles.
- [Streaming system design interview questions](https://datadriven.io/streaming-system-design-interview-questions): When the SLA forces Kafka plus Flink plus Spark Structured Streaming.
- [CDC pipeline interview questions](https://datadriven.io/cdc-pipeline-interview-questions): Debezium versus read replica as the ingest decision.
- [Clickstream pipeline interview questions](https://datadriven.io/clickstream-pipeline-interview-questions): Batch-versus-streaming threshold on a real dashboard SLA.
- [Kafka system design interview questions](https://datadriven.io/kafka-system-design-interview-questions): Partition strategy, consumer groups, exactly-once semantics.
- [ETL design interview prep](https://datadriven.io/etl-design-interview-prep): ETL-versus-ELT placement and idempotent MERGE patterns.
- [ELT interview questions](https://datadriven.io/elt-interview-questions): Why ELT is the 2026 default and when ETL still wins.
- [Data pipeline design interview questions](https://datadriven.io/data-pipeline-design-interview-questions): Layer-by-layer design question catalog.

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.