System design practice problems for data engineer interview prep. Each problem mirrors the 45-60 minute interview round format: scenario, clarifying questions, high-level architecture, failure-mode drill per component, cost reasoning, and adapt-on-fly when the rubric flips a requirement.

Practice system design problems for data engineer roles differ from practice SQL or Python problems because there is no auto-grader for "is this architecture correct". Multiple valid designs exist for any scenario. The rubric scores the decision process and the failure-mode articulation, not a specific canonical answer.

Five scenario families anchor the practice catalog. Streaming ingestion: design a pipeline that ingests 10B clickstream events per day with a 15-minute dashboard freshness SLA, where the dashboard sits on top of a BI tool that cannot handle table swaps. Daily warehouse load: design a daily ETL from a 50TB Postgres production database to Snowflake with a 4-hour load window and full backfill capability. ML feature pipeline: design a feature store with online (10ms read latency) and offline (training-time joins) paths for a recommendation model, with feature parity between paths. Reconciliation pipeline: design a daily reconciliation between payment processor settlement reports and internal transaction events, with discrepancy alerting and idempotent re-runs. Multi-region warehouse: design a multi-region active-active warehouse for a global product with regional sovereignty requirements (EU data stays in EU).

Each problem ships with a rubric-scored verdict covering five dimensions. SLA match: does the proposed design meet the freshness, throughput, and latency requirements stated in the scenario. Cost reasoning: back-of-envelope numbers for Kafka shards, Spark workers, Snowflake credits, S3 storage class trade-offs. Failure modes: 3 per component, with detection mechanism and recovery strategy. Tool fit: why this technology (Kafka vs Kinesis vs Pub/Sub) and not the alternative, defended in one sentence. Adapt-on-fly: when the rubric flips a requirement (SLA tightens, volume jumps 100x, downstream constraint added), does the design modify in place or restart from scratch.

Common failure modes the rubric explicitly fails. Skipping clarifying questions (jumping straight to drawing without nailing down throughput, latency, durability requirements). Naming a tool without defending the choice ("I'd use Spark" without saying why over Flink or Dataflow). Missing failure modes (not naming what happens when Kafka brokers die, when Spark executors OOM, when Snowflake MERGE deadlocks). No cost reasoning at L5+ (a senior data engineer should produce rough numbers, not just architecture). Throwing out the design on a mid-round pivot instead of modifying in place.

The senior data engineer who has practiced 5 of these scenarios with rubric review usually arrives at the interview with the failure-mode taxonomy internalized. The most common positive signal is the candidate who names a failure mode the interviewer was about to ask about, before being asked. That moves the rubric from "could answer when prompted" to "anticipates the problem".

Data Engineer System Design Problems

Rubric-scored system design practice problems for data engineer interview prep.

123 practice problems matching this filter. Difficulty: medium (57), hard (66).

Pipeline Architecture (123)

Common questions

How are data engineer system design problems graded without a single correct answer?
The rubric scores the decision process and failure-mode articulation, not a specific canonical answer. SLA match (25 percent), cost reasoning (20 percent), failure modes per component (20 percent), tool fit defense (15 percent), adapt-on-fly (20 percent). Multiple valid architectures score well if the data engineer can defend each component choice and name the failure modes.
What is the most common failure mode in system design practice?
Skipping clarifying questions. The candidate jumps straight to drawing without nailing down throughput, latency, durability, replay window, and exactly-once requirements. The rubric explicitly weights the clarifying phase; spending 5 minutes there scores above hitting the keyboard immediately.
How many system design practice problems should I solve before a senior data engineer onsite?
Five well-practiced scenarios across five domains beats fifteen rushed ones on similar domains. Aim for: clickstream ingestion, daily ETL CDC, ML feature store, payment reconciliation, multi-region warehouse. Each takes 60-75 minutes including rubric review. Finish all five over 3 weeks.
What is the cost reasoning expectation at L5+?
Rough back-of-envelope numbers, not exact pricing. For 10B events per day: throughput is 116k events per second average, peak 5x; with 1KB per event that is 116 MB/s average, 580 MB/s peak; on Kinesis at 1MB/s per shard that is 116-580 shards depending on key distribution. For Snowflake: cost per TB scanned versus slot reservation. For S3: storage class trade-offs. Aim for order-of-magnitude correctness.
How should I handle the mid-round requirement flip?
Modify the existing design in place. Articulate what changes and what stays. SLA tightens from 15 min to 1 min: move from Spark Structured Streaming micro-batch to Flink streaming. Volume jumps 100x: increase Kafka shards, repartition Spark jobs, review broadcast vs sort-merge decisions. Throwing out the design and restarting is the L4 signal; modifying in place is the L5 signal.
Do I need to know specific cloud providers?
Depends on the company. Amazon expects AWS-native (Kinesis, Glue, EMR, S3, Athena, Redshift). Google expects GCP-native (Pub/Sub, Dataflow, BigQuery, Dataproc). Most other companies stay vendor-neutral. Practice both AWS-native and GCP-native variants of clickstream and CDC designs; you will use one or the other depending on the company.
How long is a typical data engineer system design round?
45 to 60 minutes for one round. Senior+ loops sometimes include a second design round (platform-level meta-question). The 45-minute version: 5 minutes clarifying, 15 minutes high-level architecture, 20 minutes failure-mode drill, 5 minutes adapt-on-fly. Pacing matters: spending 30 minutes on the high-level means no time for the drill.