Data engineering system design questions with rubric-scored verdicts. Scenarios pulled from real data engineer interview reports: 10B-event-per-day clickstream, daily Postgres-to-Snowflake CDC, ML feature store with online and offline paths, payment reconciliation with idempotency, multi-region active-active warehouse. Each problem ships with the 3-failure-modes-per-component drill the L5+ rubric expects.

Data engineering system design questions in 2026 cover six canonical scenarios. Each appears in interview reports across multiple companies with consistent rubric weights. The data engineer who has practiced each scenario end-to-end with explicit failure-mode articulation arrives at the round with the architecture pattern ready.

Clickstream ingestion at 10B events per day. Web SDK ships events to a local buffer (avoids blocking on network). Local buffer flushes to a CDN-fronted ingest endpoint that writes to Kafka (24 partitions per topic for 100k events-per-second throughput). Spark Structured Streaming consumer with 1-minute trigger writes Parquet to S3 partitioned by date and hour. dbt micro-batch on the silver bronze sweep at hourly cadence promotes to gold star schemas in Snowflake. Failure modes: Kafka broker dies (replication factor 3, ISR replication), Spark executor OOM (memory tuning, broadcast threshold), S3 partition late (watermark-driven late-data handling, MERGE for backfill).

Daily ETL from Postgres to Snowflake via CDC. Debezium connector watches the Postgres write-ahead log, emits change events to Kafka. Kafka Connect S3 Sink writes raw immutable change events to S3 partitioned by source table and date. Spark daily ETL reads the partitions, applies dedup on the composite natural key (table_pk, op_ts), and writes to a staging schema in Snowflake. Snowflake MERGE INTO on the staging schema upserts to the production gold tables, with run_id baked into the output partition for backfill safety. Failure modes: Debezium falls behind (lag monitoring, restart from snapshot), Spark dedup misses a late event (MERGE on composite key is idempotent, late events still update), Snowflake MERGE deadlocks with concurrent writer (serialize via lock or queue).

ML feature store with online and offline paths. Online path: Flink streaming job consumes Kafka events, computes features on event-time windows (tumbling 5-minute, sliding 1-hour, session 30-minute gap), writes to Redis with 10-millisecond read latency for serving. Offline path: Spark batch reads the same Kafka topic from S3 Parquet, computes the same features, writes to S3 Parquet feature store with Feast catalog. Training: as-of joins where feature_ts less-than-or-equal-to label_ts to prevent label leakage. Failure modes: Flink and Spark compute features differently (centralized feature library shared between both paths), Redis OOM (eviction policy, separate hot and cold features), training-serving skew (monitor feature distributions, alert on drift).

Payment reconciliation with idempotency. Postgres transactional events stream via Debezium to Kafka. Kafka Connect S3 Sink writes raw immutable events to S3. Spark daily reconciliation job reads the S3 partitions, joins to the payment processor's settlement reports (also landed in S3), and produces a reconciled fact table with txn_id, settlement_id, status, and any discrepancy. Snowflake MERGE on (txn_id, run_id) upserts to the reconciliation table. Idempotency comes from the run_id baked into the output: re-running yesterday's reconciliation today produces the same result, which is the on-call retry story.

The 3-failure-modes-per-component drill at L5+. For Kafka: broker dies (ISR replication handles), partition skew (key distribution review), consumer lag (autoscale). For Spark Structured Streaming: executor OOM, watermark too aggressive, checkpoint corruption. For Snowflake MERGE: deadlock, partition not yet committed, schema drift. For Redis online feature store: OOM eviction, slow replication, hot-key skew. The data engineer who names these proactively scores above the L4 bar.

Data Engineering System Design Questions

End-to-end system design problems for data engineer interview prep.

123 practice problems matching this filter. Difficulty: medium (57), hard (66).

Pipeline Architecture (123)

Common questions

What scenarios do data engineering system design rounds use?
Six recur most often: 10B-event-per-day clickstream ingestion, daily Postgres-to-Snowflake CDC via Debezium, ML feature store with online and offline paths, daily payment reconciliation with idempotency, multi-region active-active warehouse, real-time analytics dashboard with micro-batch trigger. Each has a canonical architecture; company-specific variations swap AWS-native, GCP-native, or Spark-and-Iceberg.
How does Debezium fit into a CDC pipeline?
Debezium watches a source database's write-ahead log (Postgres WAL, MySQL binlog, MongoDB oplog) and emits change events (insert, update, delete) to Kafka. Each event includes the before-state, after-state, and operation type. Downstream consumers read from Kafka and apply the changes to a destination (data lake S3, warehouse Snowflake, OLAP store ClickHouse). Debezium provides at-least-once delivery; downstream consumers must dedup on a composite key (pk, op_ts) for idempotency.
How does a data engineer design exactly-once semantics in a streaming pipeline?
Exactly-once is a combination of at-least-once delivery (Kafka with replication, Debezium with snapshot recovery) plus idempotent consumer logic (dedup on composite key, MERGE INTO with run_id, transactional writes with two-phase commit). Pure exactly-once at the message level is rarely needed; what business cares about is exactly-once effect, which is achieved with idempotency. Flink and Spark Structured Streaming offer end-to-end exactly-once with checkpoints and transactional sinks.
What is the as-of join in an ML feature store?
An as-of join matches a training label to feature values that existed at the time of the label, preventing future leakage. For each label (label_id, label_ts), retrieve the feature value where feature_ts less-than-or-equal-to label_ts and feature_ts is the maximum such value. SQL pattern: ROW_NUMBER OVER (PARTITION BY label_id ORDER BY feature_ts DESC) within a join on feature_ts less-than-or-equal-to label_ts, filter rn equals 1. Feast and Tecton handle this natively; custom feature stores implement the SQL.
How does a data engineer ensure idempotent batch ETL?
Three patterns. Run_id baked into output partition: re-running with the same run_id overwrites the same partition. MERGE INTO with composite natural key: re-running upserts to the same rows. Append-only with version column: re-running adds new versions, downstream queries filter on max(version). Stripe and Netflix data engineering teams use combinations of all three depending on the table.
What is the late-arriving data problem in streaming?
Events arrive at the processing system after their event_time has passed: a phone offline for 3 hours, a retry of a failed network call, an out-of-order Kafka message due to partition rebalance. Solutions: watermarks (Flink and Spark Structured Streaming have explicit watermark configuration that bounds lateness), allowed lateness (events more than N hours late are dropped or sent to a dead-letter queue), idempotent MERGE downstream (late events update past aggregates without overwriting). Netflix data engineering interviews emphasize this.
What is the difference between micro-batch and true streaming?
Micro-batch (Spark Structured Streaming default) processes events in fixed time windows (1 second to 5 minutes). Latency is the window duration plus processing time, typically seconds. True streaming (Flink, Kafka Streams) processes events one at a time as they arrive, with latency in milliseconds. Most analytics dashboards do not need true streaming; micro-batch at 1-minute trigger meets a 15-minute SLA with operational simplicity. Use true streaming for sub-second latency (fraud detection, ops monitoring, ad serving).