Loading lesson...

Schema Evolution and Late Data: Advanced

At scale, schema policy and lateness budget become first-class design choices, not afterthoughts

At scale, schema policy and lateness budget become first-class design choices, not afterthoughts

Category
Pipeline Architecture
Difficulty
advanced
Duration
40 minutes
Challenges
0 hands-on challenges

Topics covered: Schema Enforcement in CDC Pipelines, Watermarks for Irregular Streams, Allowed Lateness Versus Accuracy, Reconciliation Passes, Late Trade Settlements

Lesson Sections

  1. Schema Enforcement in CDC Pipelines (concepts: paCdcSchemaPolicy, paSchemaQuarantine)

    Change Data Capture pipelines turn DDL changes upstream into data events downstream. When an operational database schema changes, the CDC pipeline does not get to vote. The change is observed, serialized, and shipped as part of the event stream. This makes CDC pipelines the most schema-fragile category of pipeline in production. A single ALTER TABLE in a payments database can cascade into hundreds of downstream consumers in the time it takes the change-data event to traverse the broker. What Mak

  2. Watermarks for Irregular Streams (concepts: paWatermarkStrategies)

    The intermediate tier introduced bounded out-of-orderness as the default watermark strategy. In production, very few real streams behave the way that strategy assumes. Streams from mobile devices have multi-modal lateness distributions. Streams from IoT sensors have idle periods longer than any reasonable timeout. Streams from financial systems carry markers that explicitly declare segment boundaries. Picking a watermark strategy is a per-source design decision, not a default. The Strategy Catal

  3. Allowed Lateness Versus Accuracy (concepts: paAllowedLatenessTradeoff)

    Allowed lateness is a budget. The engine holds window state for some configurable duration after the watermark closes the window, accepting and merging late events during that hold. The longer the budget, the more state the engine consumes. The shorter the budget, the more events get dropped past the boundary. There is no setting that gets both. The work is choosing the right point on the curve for the workload, and knowing what to do for the events past the boundary. The Tradeoff Curve The righ

  4. Reconciliation Passes (concepts: paReconciliation)

    A reconciliation pass is a periodic batch job that reads the canonical source data over a closed historical window and overwrites the corresponding rows of the streaming output. The streaming pipeline serves real-time consumers with approximate-but-fast output. The reconciliation pass serves audit-grade consumers with eventually-correct output. Both write to the same target. The contract is that the latest writer wins per partition, and the reconciliation runs late enough that even the long-tail

  5. Late Trade Settlements (concepts: paReconciliation, paStreamingPlusBatch, paLateData)

    A capital-markets data platform delivers per-symbol minute bars to algorithmic trading desks. The bars must be available within seconds of the minute closing. They must also be reconcilable to the firm's official settlement system within audit tolerance. Trade fills sometimes settle late: cross-border trades that book through a foreign settlement window can land back-dated by a full business day. The design composes a streaming pipeline, a schema policy, and a reconciliation pass to satisfy both