# Deduplicate the Stream

> Kafka replayed 15 minutes. Your dashboard doubled.

Canonical URL: <https://datadriven.io/problems/spark_streaming_kafka_dedup>

Domain: PySpark · Difficulty: hard · Seniority: L5

## Problem

A Structured Streaming job reads click events from Kafka, joins against a user dimension, and writes aggregated metrics to Delta Lake every 2 minutes. After a Kafka broker restart last week, the consumer group replayed 15 minutes of events, creating duplicate click counts in the output. The business team noticed inflated metrics for that window. Add watermark-based deduplication so that late or replayed events within a 30-minute window are dropped.

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/spark_streaming_kafka_dedup)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.