Loading lesson...
Schema Evolution and Late Data: Intermediate
Schema registries and watermarks turn ad-hoc tolerance into engineered guarantees
Schema registries and watermarks turn ad-hoc tolerance into engineered guarantees
- Category
- Pipeline Architecture
- Difficulty
- intermediate
- Duration
- 32 minutes
- Challenges
- 0 hands-on challenges
Topics covered: Schema Registries: Where They Live, The Expand-Contract Pattern, Event Time Versus Processing Time, Watermarks: The Engine's Promise, 1-Hour Allowed Lateness
Lesson Sections
- Schema Registries: Where They Live (concepts: paSchemaRegistry)
The beginner tier treated schemas as an implicit contract. The intermediate tier turns that contract into a system of record. A schema registry is a service that stores schema definitions, assigns each one an immutable version, and runs compatibility checks before accepting a new version. Producers register the schema before publishing data under it. Consumers fetch the schema by version when they read. The registry is the single source of truth for what shape data should have at each moment in
- The Expand-Contract Pattern (concepts: paExpandContract)
The additive default works for most schema changes. It does not work for the breaking ones. When a column must be renamed, dropped, or restructured, every consumer downstream has to adapt. Doing this in a single deploy is impossible at any reasonable scale. The expand-contract pattern is the production technique for rolling out a breaking change without a flag day. It splits the change into four phases that allow producers and consumers to migrate independently. The Four Phases Each phase is a d
- Event Time Versus Processing Time (concepts: paEventTimeVsProcessingTime)
The beginner tier introduced event time and processing time as a pair. The intermediate tier turns the distinction into the foundation of every streaming aggregation. The choice of which time domain to use is not a stylistic preference. It is the single most consequential decision in a streaming pipeline. It determines what numbers the consumer sees, how the system handles late events, and how much state the engine has to keep around. The Two Domains, Restated Most production streaming systems h
- Watermarks: The Engine's Promise (concepts: paWatermarks)
If a streaming engine waits forever for late events, no window ever closes and no result is ever produced. If it does not wait at all, every late event is dropped and every aggregation is wrong. The watermark is the compromise. A watermark is a timestamp that the engine emits, periodically, declaring that no events with event_time earlier than the watermark will be processed against an open window. The watermark is the engine's commitment to a closing rule. What a Watermark Actually Is A waterma
- 1-Hour Allowed Lateness (concepts: paAllowedLateness, paWatermarks)
The pieces from the prior sections combine into a single concrete configuration. A streaming aggregation tumbles in 5-minute event-time windows, with a watermark using bounded out-of-orderness of 60 seconds, plus an allowed lateness of 60 minutes. This configuration shows up in production at scale; the numbers are tuned per workload. The walkthrough below traces what happens to events that arrive on time, slightly late, and very late. The Configuration The configuration says four things. Windows