# A streaming pipeline computes rolling 5-minute click counts per page from a Kafka topic

Canonical URL: <https://datadriven.io/problems/a-streaming-pipeline-computes-rolling-5-minute-click-counts-51d9495a>

Domain: Pipeline Design · Difficulty: medium

## Problem

A streaming pipeline computes rolling 5-minute click counts per page from a Kafka topic. The transform on the canvas (Spark Structured Streaming with a windowed GROUP BY) is stateful: its output for the current 5-minute window depends on every event for that page seen so far in that window. The pipeline is missing the state store the section just taught is required for stateful transforms; without a checkpointed state store, the engine cannot survive a restart and cannot bound watermark-driven state cleanup. Apply the stateful-vs-stateless classification this section just taught and add a checkpointed state store node (RocksDB on local disk, an S3-backed checkpoint location, or HDFS) co-located with the streaming transform so the engine can persist windowed state and recover after a failure. Do not change the transform itself or the warehouse mart's slaFreshness; the only architectural delta is the state store the stateful transform requires.

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/a-streaming-pipeline-computes-rolling-5-minute-click-counts-51d9495a)
- [System Design Interview Questions](https://datadriven.io/data-engineering-system-design)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.