Loading...

Deduplicate the Stream

A hard spark interview practice problem on DataDriven. Write and execute real spark code with instant grading.

Domain
spark
Difficulty
hard
Seniority
senior

Problem

A Structured Streaming job reads click events from Kafka, joins against a user dimension, and writes aggregated metrics to Delta Lake every 2 minutes. After a Kafka broker restart last week, the consumer group replayed 15 minutes of events, creating duplicate click counts in the output. The business team noticed inflated metrics for that window. Add watermark-based deduplication so that late or replayed events within a 30-minute window are dropped.

Practice This Problem

Solve this spark problem with real code execution. DataDriven runs your solution and grades it automatically.