A streaming pipeline computes rolling 5-minute click counts per page from a Kafka topic
A medium Pipeline Design interview practice problem on DataDriven. Write and execute real pipeline design code with instant grading.
- Domain
- Pipeline Design
- Difficulty
- medium
Problem
A streaming pipeline computes rolling 5-minute click counts per page from a Kafka topic. The transform on the canvas (Spark Structured Streaming with a windowed GROUP BY) is stateful: its output for the current 5-minute window depends on every event for that page seen so far in that window. The pipeline is missing the state store the section just taught is required for stateful transforms; without a checkpointed state store, the engine cannot survive a restart and cannot bound watermark-driven state cleanup. Apply the stateful-vs-stateless classification this section just taught and add a checkpointed state store node (RocksDB on local disk, an S3-backed checkpoint location, or HDFS) co-located with the streaming transform so the engine can persist windowed state and recover after a failure. Do not change the transform itself or the warehouse mart's slaFreshness; the only architectural delta is the state store the stateful transform requires.
Practice This Problem
Solve this Pipeline Design problem with real code execution. DataDriven runs your solution and grades it automatically.