Streaming system design interview questions for data engineer roles. Kafka + Flink + Spark Structured Streaming architectures. Exactly-once via at-least-once plus idempotency. Watermarks and allowed lateness. Event-time versus processing-time. Stateful streaming with RocksDB. The patterns that compose 80 percent of streaming data engineer interviews in 2026.

Streaming system design rounds for data engineer roles test six recurring concerns. Choosing the right streaming engine: Kafka Streams for in-Kafka transformations and lightweight processing; Flink for stateful streaming at scale with exactly-once and event-time windowing; Spark Structured Streaming for unified batch and streaming with familiar Spark API; Beam on Dataflow for GCP-native; Kinesis Data Analytics for AWS-native. Tradeoffs: Flink has the strongest stateful and event-time story; Spark has the strongest batch-integration story; Kafka Streams has the lowest operational overhead but limited to Kafka-only.

Exactly-once semantics in streaming pipelines. The hard truth is that exactly-once at the message-delivery level is impossible without coordination across producer, broker, and consumer; what business cares about is exactly-once effect, which is achievable with at-least-once delivery plus idempotent processing. Kafka offers transactional writes with isolation_level = read_committed for end-to-end exactly-once within the Kafka ecosystem. Flink provides exactly-once with checkpoint snapshots and two-phase commit to transactional sinks. Spark Structured Streaming provides exactly-once with checkpoint-based fault tolerance and idempotent sinks (Delta, Iceberg with MERGE).

Event-time versus processing-time. Event-time is when the event happened at the source (the user clicked at 14:23:05). Processing-time is when the streaming engine sees the event (14:25:30 due to network latency, retries, batching). Streaming windows can be defined in either dimension. Event-time windows handle out-of-order and late-arriving events correctly but require watermarks to bound how long to wait. Processing-time windows are simpler but produce wrong results when events arrive late. Most data engineer interview rounds expect event-time windows with watermarks; processing-time is acceptable for monitoring and ops dashboards where staleness is tolerated.

Watermarks and allowed lateness. A watermark is the streaming engine's commitment that no events with event_time earlier than the watermark will arrive. Flink and Spark Structured Streaming both have explicit watermark configuration. Setting the watermark too aggressive causes late events to be dropped (or sent to a side-output); setting it too conservative delays results and increases state size. Typical configuration: watermark = current_event_time minus 5 minutes for most workloads, minus 1 hour for higher-latency sources. Allowed lateness extends the window's reactivity beyond the watermark: events arriving within allowed lateness update past results.

Stateful streaming with RocksDB. Flink and Spark Structured Streaming both use RocksDB as the embedded state backend for large state (millions to billions of keys per executor). State is checkpointed to durable storage (S3, GCS) for fault tolerance. Sessionization, deduplication, joins, and aggregations all build state. The state size grows with the cardinality of the partitioning key and the watermark configuration; oversized state causes executor OOM and checkpoint timeouts. Senior data engineer system design rounds test whether the candidate sizes state correctly and discusses RocksDB tuning (block cache size, compaction).

Companies whose data engineer interviews emphasize streaming heavily: Netflix (Mantis platform, Flink for ops monitoring, Spark Structured Streaming for analytics), Uber (Flink for ride dispatching analytics, Kafka Streams for some flows), Stripe (Kafka with idempotent consumers for financial-data exactly-once), Meta (internal stream-processing tools, late-arriving conversion handling), Pinterest (Kafka and Flink for real-time recommendations).

Streaming System Design Interview Questions

Streaming pipeline design problems for data engineer interview prep.

123 practice problems matching this filter. Difficulty: medium (57), hard (66).

Pipeline Architecture (123)

Common questions

Which streaming engine should a data engineer pick: Flink, Spark Structured Streaming, or Kafka Streams?
Flink for stateful streaming at scale with exactly-once and event-time windowing as primary requirements (Netflix, Uber, Stripe). Spark Structured Streaming for unified batch and streaming with the familiar Spark API and Spark-team operational expertise (Databricks, generic Spark shops). Kafka Streams for in-Kafka transformations with lowest operational overhead and Kafka-only data (logging, metrics enrichment). Beam on Dataflow if you are on GCP.
How does a streaming pipeline achieve exactly-once semantics?
Through at-least-once delivery (Kafka with replication, source connector with snapshot recovery) plus idempotent processing (dedup on composite key, MERGE INTO with run_id, transactional writes). Pure exactly-once at the message level is impossible without producer-broker-consumer coordination; what matters is exactly-once effect. Flink + Kafka transactional commits provide end-to-end exactly-once within Flink-managed sinks. Spark Structured Streaming provides exactly-once with checkpoint-based fault tolerance and idempotent sinks (Delta, Iceberg).
What is the difference between event-time and processing-time?
Event-time is when the event happened at the source (user clicked at 14:23:05). Processing-time is when the streaming engine sees the event (14:25:30 due to network delay or retry). Event-time windows produce correct results for out-of-order events but require watermarks to bound waiting. Processing-time windows are simpler but produce wrong results when events arrive late. Most data engineer interview rounds expect event-time windows.
What is a watermark in streaming?
A watermark is the streaming engine's commitment that no events with event_time earlier than the watermark will arrive. Flink and Spark Structured Streaming both have explicit watermark configuration. Typical setting: watermark = current_event_time minus 5 minutes for most workloads. Setting it too aggressive drops late events; setting it too conservative delays results and grows state size. The watermark configuration is a senior data engineer system design rubric item.
What is allowed lateness in Flink and Spark Structured Streaming?
Allowed lateness extends a window's reactivity beyond the watermark. Events arriving after the watermark passed but within allowed lateness update past results; events arriving after allowed lateness are dropped or sent to a side-output. Useful for handling late phone-offline data without growing state indefinitely. Configure based on the longest legitimate lateness observed in production.
How does stateful streaming work with RocksDB?
Flink and Spark Structured Streaming both use RocksDB as the embedded state backend for large state (millions to billions of keys per executor). State is checkpointed to durable storage (S3, GCS) for fault tolerance. Sessionization, deduplication, joins, and aggregations all build state. State size grows with the partitioning-key cardinality and the watermark configuration; oversized state causes executor OOM and checkpoint timeouts.
What is the difference between Kafka and Kinesis from a data engineer interview perspective?
Functionally similar: distributed, partitioned, durable, append-only log. Kafka is the open-source default with the largest ecosystem (Kafka Connect, Schema Registry, ksqlDB, Confluent). Kinesis is AWS-managed with simpler ops and tighter AWS integration (Firehose, Kinesis Data Analytics, IAM). Throughput is comparable per shard/partition. Kafka has lower latency, higher throughput per partition, and more flexibility. Kinesis has lower operational overhead. Choose based on the company's existing stack.
How does a data engineer size a streaming pipeline?
Start with throughput: events per second average and peak. Each event size in bytes. Convert to bytes per second. Divide by per-shard throughput (1MB/s on Kinesis, configurable on Kafka, typically 10MB/s per Kafka partition for safe headroom). Add 2-3x headroom for peak and rebalances. Sample math: 10B events/day = 116k events/sec average, 580k peak, with 1KB events = 580MB/s peak = 58 Kafka partitions or 580 Kinesis shards.