A streaming pipeline computes rolling 5-minute click counts per page from a Kafka topic
A medium Pipeline Design mock interview question on DataDriven. Practice with AI-powered feedback, real code execution, and a hire/no-hire decision.
- Domain
- Pipeline Design
- Difficulty
- medium
Interview Prompt
A streaming pipeline computes rolling 5-minute click counts per page from a Kafka topic. The transform on the canvas (Spark Structured Streaming with a windowed GROUP BY) is stateful: its output for the current 5-minute window depends on every event for that page seen so far in that window. The pipeline is missing the state store the section just taught is required for stateful transforms; without a checkpointed state store, the engine cannot survive a restart and cannot bound watermark-driven state cleanup. Apply the stateful-vs-stateless classification this section just taught and add a checkpointed state store node (RocksDB on local disk, an S3-backed checkpoint location, or HDFS) co-located with the streaming transform so the engine can persist windowed state and recover after a failure. Do not change the transform itself or the warehouse mart's slaFreshness; the only architectural delta is the state store the stateful transform requires.
How This Interview Works
- Read the vague prompt (just like a real interview)
- Ask clarifying questions to the AI interviewer
- Write your pipeline design solution with real code execution
- Get instant feedback and a hire/no-hire decision