Loading lesson...
Event time and processing time diverge; watermarks tell the system when to stop waiting
Event time and processing time diverge; watermarks tell the system when to stop waiting
Topics covered: "What If Events Arrive Out of Order?", Event Time vs Processing Time, Watermarks: The Completeness Heuristic, Allowed Lateness and Side Outputs, Watermark Strategies for Real Pipelines
What They're Really Testing The Unlock The mental model: event time tells you WHEN it happened. Processing time tells you WHEN YOU LEARNED about it. A click at 2:00 PM that arrives at 5:00 PM should be counted in the 2:00 PM window, not the 5:00 PM window. But how does the system know the 2:00 PM window is complete if events are still arriving for it at 5:00 PM? That is the watermark problem. 15% The 60-Second Framework Step 4 is the strong-hire signal. The latency-correctness tradeoff is the ce
This is the fundamental distinction. Get it wrong and every time-windowed aggregation in your pipeline is unreliable. Why Processing Time Fails Red Flag Phrases
A watermark is the system's assertion: 'I believe all events with event_time <= W have been received.' When the watermark advances past the end of a window, the window closes and results are emitted. Events arriving after the watermark with event_time inside a closed window are 'late.' How Watermarks Work The Latency-Correctness Tradeoff The strongest insight: 'A watermark is not a guarantee. It is a heuristic. Events can arrive after the watermark, and the system must have a plan for them. That
Watermarks close windows, but late events still arrive. Allowed lateness keeps windows open for a grace period after the watermark passes. Events arriving within the grace period update the window result. Events arriving after the grace period are routed to a side output. The Three-Tier Late Data Strategy Side Output Architecture Batch Reconciliation Side output events are cold storage. A daily batch job reads the side output, groups by the original window, and merges corrections into the aggreg
Production watermarks are more nuanced than the textbook version. Different sources have different lateness profiles. Kafka timestamps behave differently from custom event timestamps. Multi-source pipelines need per-source watermarks merged at the join point. Watermark Strategies by Source Multi-Source Watermarks When joining two streams with different lateness profiles, the system watermark is the MINIMUM of the individual source watermarks. A fast source (Kafka, seconds late) joined with a slo