Loading interview...
Deduplicate the Stream
A hard Spark mock interview question on DataDriven. Practice with AI-powered feedback, real code execution, and a hire/no-hire decision.
- Domain
- Spark
- Difficulty
- hard
- Seniority
- senior
Interview Prompt
A Structured Streaming job reads click events from Kafka, joins against a user dimension, and writes aggregated metrics to Delta Lake every 2 minutes. After a Kafka broker restart last week, the consumer group replayed 15 minutes of events, creating duplicate click counts in the output. The business team noticed inflated metrics for that window. Add watermark-based deduplication so that late or replayed events within a 30-minute window are dropped.
How This Interview Works
- Read the vague prompt (just like a real interview)
- Ask clarifying questions to the AI interviewer
- Write your spark solution with real code execution
- Get instant feedback and a hire/no-hire decision