DataDriven
LearnPracticeInterviewDiscussDaily
HelpContactPrivacyTermsSecurityiOS App

© 2026 DataDriven

Loading lesson...

  1. Home
  2. Learn
  3. Late Data and Watermarks

Late Data and Watermarks

Event time and processing time diverge; watermarks tell the system when to stop waiting

Event time and processing time diverge; watermarks tell the system when to stop waiting

Category
Pipeline Architecture
Difficulty
advanced
Duration
25 minutes
Challenges
0 hands-on challenges

Topics covered: "What If Events Arrive Out of Order?", Event Time vs Processing Time, Watermarks: The Completeness Heuristic, Allowed Lateness and Side Outputs, Watermark Strategies for Real Pipelines

Lesson Sections

  1. "What If Events Arrive Out of Order?"

    What They're Really Testing The Unlock The mental model: event time tells you WHEN it happened. Processing time tells you WHEN YOU LEARNED about it. A click at 2:00 PM that arrives at 5:00 PM should be counted in the 2:00 PM window, not the 5:00 PM window. But how does the system know the 2:00 PM window is complete if events are still arriving for it at 5:00 PM? That is the watermark problem. 15% The 60-Second Framework Step 4 is the strong-hire signal. The latency-correctness tradeoff is the ce

  2. Event Time vs Processing Time

    This is the fundamental distinction. Get it wrong and every time-windowed aggregation in your pipeline is unreliable. Why Processing Time Fails Red Flag Phrases

  3. Watermarks: The Completeness Heuristic

    A watermark is the system's assertion: 'I believe all events with event_time <= W have been received.' When the watermark advances past the end of a window, the window closes and results are emitted. Events arriving after the watermark with event_time inside a closed window are 'late.' How Watermarks Work The Latency-Correctness Tradeoff The strongest insight: 'A watermark is not a guarantee. It is a heuristic. Events can arrive after the watermark, and the system must have a plan for them. That

  4. Allowed Lateness and Side Outputs

    Watermarks close windows, but late events still arrive. Allowed lateness keeps windows open for a grace period after the watermark passes. Events arriving within the grace period update the window result. Events arriving after the grace period are routed to a side output. The Three-Tier Late Data Strategy Side Output Architecture Batch Reconciliation Side output events are cold storage. A daily batch job reads the side output, groups by the original window, and merges corrections into the aggreg

  5. Watermark Strategies for Real Pipelines

    Production watermarks are more nuanced than the textbook version. Different sources have different lateness profiles. Kafka timestamps behave differently from custom event timestamps. Multi-source pipelines need per-source watermarks merged at the join point. Watermark Strategies by Source Multi-Source Watermarks When joining two streams with different lateness profiles, the system watermark is the MINIMUM of the individual source watermarks. A fast source (Kafka, seconds late) joined with a slo

Related

  • All Lessons
  • Practice Problems
  • Mock Interview Practice
  • Daily Challenges