System design interview prep for data engineer roles. End-to-end pipeline design rounds last 45 to 60 minutes. Scenarios include 10B-events-per-day clickstream, 15-minute-freshness dashboard, 28-day late-arriving conversion window, multi-region replication. Rubric weights SLA match, cost reasoning, 3 failure modes per component, tool fit, and adapt-on-fly when the interviewer changes a requirement.

The system design round shows up on 52 percent of senior-and-above data engineer interview loops. Format is 45 to 60 minutes, end-to-end pipeline design on a whiteboard or canvas, with a concrete scenario like "10 million events per day, 15-minute dashboard freshness SLA, downstream BI tool that cannot handle table swaps". The interviewer expects a high-level architecture, then drills on three failure modes per component, cost reasoning at scale, and the on-call story (what gets paged, who responds, what the runbook says).

Six scenario shapes recur across data engineer system design rounds in 2026. Clickstream ingestion: 10B events per day, web SDK to local buffer to Kafka to Spark Structured Streaming or Flink to Parquet on S3 partitioned by date and hour to dbt to gold star schemas in Snowflake. Daily ETL from Postgres to Snowflake: Debezium CDC to Kafka to S3 raw immutable to Spark daily ETL with run_id baked into output partitions to Snowflake MERGE on composite natural key. ML feature store: real-time path uses Flink to Redis with 10ms reads, batch path uses Spark to S3 Parquet to Feast catalog, training uses as-of joins with feature_ts less-than-or-equal-to label_ts to prevent leakage. Daily reconciliation pipeline for payments: Postgres to Debezium to Kafka to S3 raw immutable to idempotent Spark with run_id to Snowflake MERGE on (txn_id, run_id). Multi-region active-active warehouse: region-local writes with async cross-region CDC replication, conflict resolution via last-writer-wins or CRDT for counters, SLA tiers (real-time within region, eventually-consistent across regions), 2x storage minimum. Real-time analytics dashboard: micro-batch with Spark Structured Streaming on 1-minute trigger, Materialize or Druid for serving, hourly Spark to Snowflake for historical.

The L5+ rubric explicitly weights three failure modes per component. For each box on the whiteboard, name what happens when it dies, when it gets backed up, and when the upstream schema changes. For Kafka: broker dies (replication factor 3 handles N-1 failures), partition skew (key distribution review, repartitioning), consumer lag (autoscale consumer group, check downstream). For Spark Structured Streaming: executor OOM (memory tuning, broadcast threshold), watermark too aggressive (late data drops, increase watermark), checkpoint corruption (delete checkpoint and reprocess from earliest acceptable offset). For Snowflake MERGE: deadlock with concurrent writer (serialize via lock or queue, use insert_overwrite pattern), partition not yet committed (delay merge until processing watermark advances), schema drift (schema registry enforcement at producer side).

Companies that emphasize system design heavily in data engineer loops: Netflix (Spark and Iceberg with streaming and late-arriving data), Stripe (idempotent reconciliation, financial-data audit, multi-region for global payments), Meta (ads attribution with 28-day windows, feed-ranking signals pipeline at 10B+ events per day), Amazon (AWS-native architectures: Kinesis to Firehose to S3 to Glue and Athena and Redshift), Google (GCP-native: Pub/Sub to Dataflow to BigQuery), Databricks (Spark expertise, AQE, Delta MERGE INTO with optimize). The senior data engineer who has practiced 5 of these architectures end-to-end with explicit failure-mode articulation usually clears the system design round at any of them.

Data Engineer System Design Interview Prep

Prep for the system design round of a data engineer interview loop with rubric-scored practice scenarios.

123 practice problems matching this filter. Difficulty: medium (57), hard (66).

Pipeline Architecture (123)

Common questions

What percentage of data engineer interviews include a system design round?
Roughly 52 percent of senior-and-above data engineer interview loops include an explicit system design round. The share rises with seniority: nearly all L5+ data engineer loops include system design, and L6+ loops often include two design rounds (one pipeline-specific, one platform-level). Format is 45 to 60 minutes on a whiteboard or canvas.
What does the system design rubric score for data engineer interviews?
Five dimensions in most companies' rubrics. SLA match (25 percent): does the design meet the freshness, throughput, and latency requirements. Cost reasoning (20 percent): back-of-envelope numbers for Kafka shards, Spark workers, Snowflake credits. Failure modes (20 percent): 3 per component, with detection and recovery. Tool fit (15 percent): why this technology and not the alternative. Adapt-on-fly (20 percent): when the interviewer changes a requirement mid-round, does the design modify in place or restart.
What scenarios are most common in data engineer system design rounds?
Clickstream ingestion (10B events per day), daily ETL Postgres to Snowflake, ML feature store with online and offline paths, daily payment reconciliation, multi-region active-active warehouse, real-time analytics dashboard with micro-batch trigger. Each has a canonical architecture with company-specific variations: AWS-native at Amazon, GCP-native at Google, Spark-and-Iceberg at Netflix, idempotent-reconciliation at Stripe.
How does the cost reasoning part of the rubric work?
The L5+ rubric expects rough back-of-envelope numbers. For 10B events per day on Kafka: throughput is 116k events per second average, peak roughly 5x; with 1KB per event that is 116 MB/s average, 580 MB/s peak; on a 100MB/s-per-shard Kinesis-equivalent that is 2-6 shards depending on partition keys. For Snowflake: cost per TB scanned versus slot-reservation versus on-demand pricing. For S3: storage class trade-offs (Standard vs Infrequent Access vs Glacier).
What is the 3-failure-modes-per-component expectation?
For each box on the whiteboard, name what happens when it dies (replication, failover), when it gets backed up (autoscale, backpressure), and when the upstream schema changes (schema registry, schema-on-read, dead-letter queue). Senior data engineer candidates do this proactively; junior candidates wait to be asked.
How does a data engineer prep for the mid-round pivot?
Practice with a peer or AI mock interviewer that explicitly changes the requirements halfway through. Common pivots: SLA tightens from 15 minutes to 1 minute (requires moving from micro-batch to streaming), data volume jumps 100x (requires partitioning strategy review, broadcast vs sort-merge join decision flip), the BI tool cannot handle table swaps (requires insert_overwrite or materialized view pattern instead of CTAS). The L5 signal is articulating what changes and what stays in the existing design without throwing it out.
How long is a system design round?
45 to 60 minutes for one round at most companies. Senior+ loops sometimes add a second design round (a 'design the platform' meta-question). The 45-minute version expects high-level architecture in 15 minutes, drill on 2-3 components in 20 minutes, and adapt-on-fly plus questions in the final 10. Pacing matters: spending 30 minutes on the high-level architecture means no time for the drill.
What stack should a data engineer assume in design rounds?
Depends on the company. AWS at Amazon (Kinesis, Glue, EMR, S3, Athena, Redshift). GCP at Google (Pub/Sub, Dataflow, BigQuery, Dataproc). Spark+Iceberg at Netflix. Presto+Hive+Spark+internal tools at Meta. Stack-neutral at smaller companies and at most non-FAANG: pick a stack you can defend and use it. Mention alternatives when they would be more appropriate.