A capital-markets data platform at a top-five investment bank ran a streaming pipeline that aggregated trade fills into per-symbol minute bars used by the firm's algorithmic desks. The streaming engine ran with a 30-second watermark and 5-minute allowed lateness. Trades that settled across the international date line, however, sometimes booked back to T-1 a full business day later. Those trades were dropped by the streaming engine and silently missing from the bars consumed by the algos. The discrepancy was caught only when a regulator's audit reconciled trade blotters to position reports and found a 0.03 percent gap. The fix was a nightly reconciliation pass: a batch job that re-aggregated the prior day's trades from the canonical settlement system and overwrote the corresponding minute bars in the streaming sink. The streaming pipeline serves real-time consumers; the batch pass serves the audit-grade truth. Both run, both write to the same table, and the contract is that the latest writer wins per minute bar. This lesson covers the production tradeoffs that drive these designs: how schemas evolve in CDC pipelines where DDL upstream becomes data downstream, how watermarks behave on irregular event streams, how lateness budgets trade memory for correctness, and how reconciliation jobs close the gap that streaming engines structurally cannot close.
Schema Enforcement in CDC Pipelines
Daily Life
Interviews
Apply additive, block-breaking, or quarantine policies to a CDC pipeline and reason about the operational tradeoffs of each.
Change Data Capture pipelines turn DDL changes upstream into data events downstream. When an operational database schema changes, the CDC pipeline does not get to vote. The change is observed, serialized, and shipped as part of the event stream. This makes CDC pipelines the most schema-fragile category of pipeline in production. A single ALTER TABLE in a payments database can cascade into hundreds of downstream consumers in the time it takes the change-data event to traverse the broker.
What Makes CDC Schema-Fragile
Property
Implication for Schema
DDL is data
Each schema change becomes an event in the change stream that downstream must process
Producer cannot pre-validate
The DBA does not consult a registry before running ALTER TABLE; the change is already in the WAL
Mixed compatibility shapes
Old rows in the stream still carry the old shape; new rows carry the new shape
Cascading downstream
Lakehouse tables, search indexes, ML feature pipelines all read from the same change stream
Operational urgency
Production DBs change fast; CDC pipelines must absorb or block within minutes
Debezium, AWS DMS, Maxwell, and similar tools all surface schema-change events as part of the stream. Each tool encodes the change differently, but the pattern is the same: a record arrives in the topic indicating that the upstream table now has a new column, a renamed column, or a dropped column. Downstream consumers see the schema-change event before they see the next data event under the new shape. The window of time between the two events is the policy decision space.
New columns flow through. Drops and renames are blocked or quarantined. The CDC pipeline keeps running for the additive case, which is more than 80 percent of real-world changes.
Block Breaking
The strict policy
Any non-additive change halts the pipeline. An on-call engineer is paged to evaluate before any data flows under the new schema. Used for audit-grade systems where silent drift is unacceptable.
Quarantine Incompatible
The middle ground
Incompatible rows are routed to a quarantine topic. The main pipeline continues with the additive subset. A human reviews quarantine before merging. Common in lakehouse ingestion.
Policy choice depends on consumer tolerance. A search index ingestion pipeline can run under allow-additive: a missing column degrades search relevance gracefully. A regulatory reporting pipeline cannot: a missing column corrupts a filing. A lakehouse ingestion pipeline often runs under quarantine: incompatible rows are not lost, but they do not flow into the canonical curated layer until the schema decision is made.
▸Which DDL classes are allowed without a human in the loop
▸Where blocked events go (DLQ, schema-quarantine topic, halt-and-page)
▸Who owns the on-call response and within what SLA
▸How the policy interacts with the schema registry's compatibility setting
•No Schema Policy
Every DDL change is a potential outage
Drift is silent until consumer dashboards break
On-call response is improvised under pressure
Recovery requires reconstructing the change history from logs
✓Explicit Policy
Additive changes flow through automatically; the team handles only the rare destructive case
Drift is impossible: the policy either accepts, blocks, or quarantines explicitly
On-call response is rehearsed: the policy names the action
Schema-change events are an audit trail by construction
1
/* Classify a sample stream of CDC schema-change events */
2
WITHeventsAS(
3
SELECT
4
1ASevt_id,
5
'transactions'AStbl,
6
'ADD_COLUMN'ASkind,
7
'merchant_country'ASfield
8
9
UNIONALL
10
11
SELECT
12
2,
13
'transactions',
14
'ADD_COLUMN',
15
'risk_score'
16
17
UNIONALL
18
19
SELECT
20
3,
21
'transactions',
22
'RENAME_COLUMN',
23
'amount_cents -> amount_minor'
24
25
UNIONALL
26
27
SELECT
28
4,
29
'transactions',
30
'DROP_COLUMN',
31
'legacy_flag'
32
33
UNIONALL
34
35
SELECT
36
5,
37
'transactions',
38
'TYPE_CHANGE',
39
'amount: int -> bigint'
40
)
41
42
SELECT
43
evt_id,
44
tbl,
45
kind,
46
field,
47
CASE
48
WHENkind='ADD_COLUMN'THEN'apply'
49
WHENkind='TYPE_CHANGE'
50
ANDfieldLIKE'%int -> bigint%'THEN'apply_widening'
51
WHENkindIN(
52
'RENAME_COLUMN',
53
'DROP_COLUMN'
54
)THEN'quarantine_and_page'
55
ELSE'quarantine'
56
ENDASpolicy_action
57
FROMevents
58
ORDERBYevt_id
The most expensive CDC outages happen when the upstream DBA assumes the data team will absorb a destructive change overnight. With a quarantine policy, the change does not absorb silently. It surfaces as a halted topic and a page within minutes.
The cost-benefit calculation for each policy depends on how often destructive changes happen and what the consumer side can tolerate. A platform that ingests from twenty operational databases owned by ten different teams will see a destructive change every few weeks. With allow-additive plus quarantine, the operations cost is bounded: the on-call engineer reviews the quarantine queue, validates that the change was intentional, and either applies it manually or works with the upstream team to revert. With block-breaking, the same change halts ingestion entirely, which protects audit consumers but disrupts every consumer downstream. With no policy at all, the change flows through and silently corrupts downstream state. The right choice depends on which kind of pain the team can absorb most cheaply, and that answer is rarely the same across all subjects in the platform.
The schema-history topic deserves its own attention. Debezium emits a parallel topic that records every DDL statement observed on the source database, in order. The reason is that to replay the change stream from any historical offset, the engine must know what schema each record was written under. The history topic is what makes replay deterministic. A team that disables schema history saves a small amount of broker storage and gives up the ability to reprocess history correctly; the trade is rarely worth it. The history topic is the same idea as a schema registry, applied at the connector level rather than the topic level.
✓Do
Separate schema-change events from data events in the topic layout
Pair the registry's compatibility mode with a downstream sink policy that matches
Put a runbook entry next to every quarantine action: who responds, in what SLA, with what tools
✗Don't
Treat ALTER TABLE upstream as a data team problem; it is a contract violation
Allow rename and drop to flow through silently; the cost shows up in audit
Skip the schema-history topic; without it, replays cannot reconstruct the right shape per offset
TIP
When designing CDC enforcement, write the runbook for the destructive case before writing the connector configuration. The connector is the easy part; the operational response to a blocked change is what the policy actually buys.
Watermarks for Irregular Streams
Daily Life
Interviews
Choose a watermark strategy appropriate to a source's lateness and idleness profile, and reason about the failure modes of each.
The intermediate tier introduced bounded out-of-orderness as the default watermark strategy. In production, very few real streams behave the way that strategy assumes. Streams from mobile devices have multi-modal lateness distributions. Streams from IoT sensors have idle periods longer than any reasonable timeout. Streams from financial systems carry markers that explicitly declare segment boundaries. Picking a watermark strategy is a per-source design decision, not a default.
The Strategy Catalog
Strategy
Watermark Computation
Failure Mode If Misapplied
Ascending timestamps
watermark = max event_time observed
Any out-of-order event is dropped; brittle in real systems
Bounded out-of-orderness
watermark = max event_time - fixed_lag
Long-tail late events past the lag are dropped silently
Punctuated
watermark advances on explicit marker events from producer
Stalls forever if producer fails to emit markers
Periodic with idle detection
Bounded out-of-orderness plus advance-on-idle when no events arrive
Idle-advance can leak past genuine slow producers
Per-key with global aggregation
Each key tracks its own watermark; global = min across keys
One slow key can stall every output globally
Each strategy makes a different bet about what the stream looks like. Picking the wrong strategy means either dropping legitimate events or stalling the pipeline. The choice should follow from a measured profile of the source: lateness percentiles, idle-period distribution, presence or absence of explicit markers.
When Each Strategy Wins
Ascending timestamps: a single ordered Kafka topic where the producer is the database WAL and offsets imply order.
Bounded out-of-orderness: web-tier event streams with a stable, narrow lateness distribution (clickstream, server logs).
Punctuated: financial market data with end-of-session markers; advertising auctions with end-of-window beacons.
Periodic with idle detection: IoT telemetry where some sensors go quiet for hours but the pipeline must keep producing output.
Per-key watermarks: a tempting trap; correctness wins per key, but global aggregations stall on the slowest key.
The Idle Source Problem
A stream from a single Kafka partition that goes quiet does not advance the watermark. If a downstream aggregation depends on this partition, it stalls until the partition produces an event. In a multi-source pipeline, the global watermark is the minimum across sources, so one idle source freezes the entire output. The fix is idle-detection: after a configurable timeout with no events, the engine treats the partition as caught up and advances its watermark on a wall-clock heartbeat.
1
# Flink watermark strategy with idle detection on a multi-source stream
Engine trusts producer to declare segment boundaries
Late tail caught precisely when producer says session is done
Tunable only by changing producer logic
Watermark profiling checklist before choosing a strategy:
▸Measure lateness percentiles: 50th, 95th, 99th, 99.9th from production data
▸Identify idle-period distribution: how often do partitions go quiet, for how long
▸Check producer capability: can it emit explicit markers, or is the engine on its own
▸Map global vs per-key correctness needs: where does aggregation happen
Per-key watermarks deserve a closer look because they are the most appealing-but-dangerous strategy in the catalog. The appeal is obvious: each user, device, or partition has its own progress, so an idle key does not stall an active one. The danger is that any global aggregation must take the minimum watermark across all keys, which means the slowest key controls the whole output. In a system with millions of keys and a long tail of stragglers, the global watermark stalls almost always. The strategy works in practice only when aggregations are scoped per key (no global view) or when the engine supports per-key emission with key-scoped watermarks downstream. Reaching for per-key watermarks without understanding this constraint is a common cause of streaming pipelines that mysteriously stop emitting output entirely.
Idle detection has its own subtle failure mode. A genuine slow producer that emits one event per hour looks identical to an idle producer that has stopped entirely. If the idle timeout is set to thirty minutes, the slow producer's event arrives after the engine has already advanced the watermark and closed the relevant window. The slow event is now late by definition, even though it was on time from the producer's perspective. The fix is to tune the idle timeout against the slowest legitimate producer's interval, not against the median. This calibration is often the difference between a streaming pipeline that handles real-world traffic and one that quietly drops slow producers' contributions.
The Apache Beam programming model abstracts watermark generation behind the WatermarkEstimator interface. Custom estimators are common in production: a financial firm may write an estimator that combines bounded out-of-orderness with punctuated markers from the trading session calendar, advancing aggressively when a session is known to be closed and conservatively otherwise. The cost of the custom estimator is justified when the source has a non-default lateness profile.
TIP
Watermark strategy is a pipeline-design decision, not a deploy-time tweak. Profile the source for at least a week of production data before locking in numbers. The lateness distribution rarely matches the team's intuition.
Allowed Lateness Versus Accuracy
Daily Life
Interviews
Quantify the tradeoff between allowed-lateness budget, engine state cost, and streaming-output coverage, and choose a percentile target accordingly.
Allowed lateness is a budget. The engine holds window state for some configurable duration after the watermark closes the window, accepting and merging late events during that hold. The longer the budget, the more state the engine consumes. The shorter the budget, the more events get dropped past the boundary. There is no setting that gets both. The work is choosing the right point on the curve for the workload, and knowing what to do for the events past the boundary.
The Tradeoff Curve
Allowed Lateness
Coverage
State Cost
Output Stability
0 seconds
Drops everything past watermark
Minimum: state released at window close
Maximum: window result is final on first emit
5 minutes
Catches network-level lateness
Low: state held briefly
High: late re-fires are rare
1 hour
Catches most mobile-tail lateness
Moderate: state size grows linearly with window count
Moderate: late re-fires happen but are bounded
24 hours
Catches multi-hour producer outages
High: 1440 minutes of windows held simultaneously
Low: results may keep changing all day
Unbounded
Catches every late event ever
Catastrophic: state grows without bound
None: no result is ever final
The right point depends on what the consumer does with the output. A real-time pricing display can tolerate occasional updates within an hour but not after the trading session closes. An audit-grade ledger needs eventual completeness, but the streaming pipeline does not have to deliver it; a separate reconciliation job can. A monitoring metric needs none of this; processing time is fine.
State Cost Math
1
Approximatestatecostforaccumulating-modewindows:state_size~(allowed_lateness/window_size)*per_window_stateExample:window_size=5minutesallowed_lateness=60minutesper_window_state=8KBperactiveKEY(aggregationtotals+metadata)activekeys=50,000windowsheld=60/5=12simultaneouslystateperKEY=12*8KB=96KBtotalstate=50,000*96KB=~4.8GBDoublingallowedlatenessto120minutesdoublesthestateto9.6GB.TriplingKEYCARDINALITYto150,000bringsstateto~14.4GBAT60minutesOR~28.8GBAT120minutes.StatelivesINtheengine's RocksDB-style store. Spillover to disk is
possible but slow; sustained checkpointing of large state is its own
bottleneck.'
Coverage Math
Lateness is a long-tail distribution in nearly every real source. The 95th percentile may be a few seconds; the 99th may be a few minutes; the 99.9th may be hours; the worst case may be days. Allowed lateness draws a line on this distribution. Events on the left side of the line are caught by the streaming pipeline. Events on the right side need a different mechanism. The decision is not 'how late can events be' but 'where should the streaming pipeline stop, given that the rest will be handled elsewhere.'
1
/* Compute coverage at a sample lateness budget */
2
WITHsample_lateness_secondsAS(
3
SELECT
4
0ASevt
5
6
UNIONALL
7
8
SELECT
9
2
10
11
UNIONALL
12
13
SELECT
14
5
15
16
UNIONALL
17
18
SELECT
19
12
20
21
UNIONALL
22
23
SELECT
24
30
25
26
UNIONALL
27
28
SELECT
29
60
30
31
UNIONALL
32
33
SELECT
34
240
35
36
UNIONALL
37
38
SELECT
39
900
40
41
UNIONALL
42
43
SELECT
44
3600
45
46
UNIONALL
47
48
SELECT
49
14400
50
51
UNIONALL
52
53
SELECT
54
86400
55
),
56
budgetsAS(
57
SELECT
58
60ASbudget_seconds
59
60
UNIONALL
61
62
SELECT
63
300
64
65
UNIONALL
66
67
SELECT
68
3600
69
70
UNIONALL
71
72
SELECT
73
86400
74
)
75
76
SELECT
77
budget_seconds,
78
COUNT(*)FILTER(
79
WHEREevt<=budget_seconds
80
)AScaught,
81
COUNT(*)AStotal,
82
ROUND(
83
100*COUNT(*)FILTER(
84
WHEREevt<=budget_seconds
85
)/COUNT(*),
86
1
87
)ASpct_covered
88
FROMsample_lateness_seconds
89
CROSSJOINbudgets
90
GROUPBYbudget_seconds
91
ORDERBYbudget_seconds
On this synthetic distribution, even a 1-hour budget catches only 82 percent of events. A 24-hour budget catches all of them, at proportionally higher state cost. The numbers in production look similar: the long tail is heavy enough that no reasonable streaming budget gets to 100 percent. The streaming pipeline gets close; the reconciliation pass gets the last percent.
How to choose the lateness budget:
▸Profile production lateness for at least one full week of representative traffic
▸Pick a percentile target (95th for most consumer dashboards, 99th for important reports)
▸Compute the state cost at that percentile and verify it fits the engine budget
▸Plan the reconciliation job for everything past the budget; this is not optional
Tight BudgetModerate BudgetWide Budget
Tight Budget
Zero to five minutes
Minimum engine state, near-final results on first emit. Drops every event beyond the budget; relies on a separate reconciliation pass for completeness.
Moderate Budget
Thirty to ninety minutes
Catches the bulk of mobile retry tail and brief producer outages. Engine state grows linearly with budget; downstream sinks must support upserts on the window key.
Wide Budget
Hours to a day
Approaches eventual completeness for most workloads. State and checkpoint costs become significant; recovery from outages slows. Rarely the right answer when reconciliation is available.
•Stretch Lateness Budget
More events caught by streaming
Engine state grows linearly with budget
Output keeps updating; consumers need idempotent sinks
Window completeness still bounded by budget
✓Reconcile in Batch
Streaming budget stays small; engine cost stable
Late events handled by a separate batch job
Streaming output is approximate; batch output is canonical
Eventual completeness is unbounded; reconciliation can scan any history
Allowed lateness interacts with engine checkpointing in ways that often surprise teams new to streaming at scale. Every active window holds state that must be checkpointed periodically for fault tolerance. Larger lateness budgets mean more concurrent active windows, which mean larger checkpoints, which mean longer checkpoint durations, which mean longer recovery times when the engine restarts. A pipeline configured with a 24-hour lateness budget can take tens of minutes to restart from a checkpoint at high cardinality, during which the streaming output is stale. The recovery time is itself a kind of lateness; widening the lateness budget to chase correctness can make the engine itself slower to recover from outages, which produces a different kind of late data.
Stretching the lateness budget to chase the last percent of correctness is the most common antipattern in streaming design. The cost grows linearly while coverage approaches an asymptote. The right answer is to accept the asymptote and pair the streaming pipeline with a reconciliation job.
✓Do
Pick a percentile target tied to consumer requirements, not to engine capacity
Treat the streaming output as approximate when the budget is below the 99.9th percentile
Capacity-plan engine state at the chosen budget, including 2x headroom for traffic spikes
✗Don't
Stretch lateness past the point where state cost dominates engine compute
Assume the streaming pipeline alone produces audit-grade output
Tune budgets without measuring the source's actual distribution
TIP
When a downstream consumer demands eventual completeness, deliver it with batch reconciliation, not by widening the streaming window. Reconciliation is bounded by storage, not by engine memory. Streaming budget is bounded by engine memory. The two have different scaling limits, and the right design uses each for what it is good at.
Reconciliation Passes
Daily Life
Interviews
Design a reconciliation pass that complements a streaming pipeline with audit-grade correctness and produces an explicit delta metric.
A reconciliation pass is a periodic batch job that reads the canonical source data over a closed historical window and overwrites the corresponding rows of the streaming output. The streaming pipeline serves real-time consumers with approximate-but-fast output. The reconciliation pass serves audit-grade consumers with eventually-correct output. Both write to the same target. The contract is that the latest writer wins per partition, and the reconciliation runs late enough that even the long-tail late events are settled before it begins.
What a Reconciliation Pass Does
Step
Action
Wait
Hold off until the lateness window for the period is fully past (e.g., next-day at 2am)
Read source of truth
Pull from the system that owns the data (operational DB, settlement system, raw lake)
Re-aggregate
Compute the same aggregation the streaming pipeline computes, over the closed window
Compare
Diff against the streaming output for the period; record the delta as a metric
Overwrite
Replace the streaming output for that period atomically (partition overwrite or MERGE)
Audit
Persist the reconciliation log: what changed, by how much, when, and why
The pass is straightforward in concept. The complexity is in the contract with the streaming pipeline. The streaming pipeline must be designed to be overwritten without breaking real-time consumers. The reconciliation must be designed to write atomically so consumers do not see partial state. The two pipelines must agree on the partition keys and time windows so the overwrite operation is well-defined.
The streaming pipeline drops events past allowed lateness. Without reconciliation, those events are silently missing from the answer.
Schema policy in CDC may quarantine events. Reconciliation against the source-of-truth captures them once the schema decision is made.
Reconciliation produces an explicit delta metric. The delta is the streaming pipeline's accuracy, measured rather than assumed.
Reconciliation closes the audit loop: 'the streaming output for any closed period agrees with the source of truth' is a property the team can prove, not hope for.
Cadence and Window
Cadence
Window
Use Case
Hourly
Last hour, run with 1-hour delay
Operational metrics where low-latency reconciliation matters
Daily
Yesterday, run at 2am next day
Most reporting workloads; aligns with downstream batch SLAs
Weekly
Prior week, run Monday morning
Audit-grade reporting where weekly close is the contract
Monthly
Prior month, run on day 1 of next month
Regulatory filings; revenue recognition; period-close systems
Cadence aligns with the consumer's reporting horizon. A daily revenue dashboard reconciles overnight. A regulatory filing reconciles monthly. The window must be long enough that the source of truth has settled, which usually means the reconciliation runs at least one full lateness budget after the streaming output for that period closes.
Design constraints for the reconciliation pass:
▸Idempotent: re-running must produce the same answer; partition overwrite is the simplest form
▸Atomic: consumers never observe partial state during the overwrite
▸Observable: every run produces a delta metric, persisted in an audit log
▸Bounded compute: the pass scans only the window it reconciles, never full history
▸Coordinated with streaming: the streaming pipeline must tolerate being overwritten
1
/* Compute the reconciliation delta for a sample period */
Audit-grade consumers read from the same target table
The reconciliation log is one of the most important operational artifacts in any streaming system. Its trend over time tells the team whether the streaming pipeline is degrading, the lateness profile is shifting, or upstream behavior has changed.
✓Do
Run reconciliation at the cadence the audit consumer needs, not at the cadence the streaming team prefers
Persist the delta metric with rich dimensions (time, key, source) so investigations can root-cause
Test that consumers can read while the overwrite is in flight; transactional table formats matter here
✗Don't
Skip the audit log; without it, the team cannot prove what was reconciled when
Reconcile inside the streaming engine; batch tools are the right home for full-window scans
Let the streaming pipeline write outside its allowed-lateness window; the contract relies on that boundary
TIP
When designing the streaming pipeline and the reconciliation pass together, decide which one owns each minute of history. The streaming pipeline owns the open period; the reconciliation owns closed periods. The handoff is the moment the lateness window expires.
Late Trade Settlements
Daily Life
Interviews
Compose schema policy, watermark strategy, allowed lateness, and reconciliation into a system that satisfies real-time and audit-grade consumers at the same time.
A capital-markets data platform delivers per-symbol minute bars to algorithmic trading desks. The bars must be available within seconds of the minute closing. They must also be reconcilable to the firm's official settlement system within audit tolerance. Trade fills sometimes settle late: cross-border trades that book through a foreign settlement window can land back-dated by a full business day. The design composes a streaming pipeline, a schema policy, and a reconciliation pass to satisfy both consumer needs.
The streaming pipeline serves the algorithmic desks during market hours. The reconciliation pass overwrites yesterday's bars overnight, after the settlement system has finalized. The minute_bars table has two writers; the contract is that the daily reconciliation is canonical for any closed day, and the streaming pipeline owns the current day.
What Each Layer Handles
Concern
Streaming Pipeline
Reconciliation Pass
Real-time consumer SLA
Sub-minute freshness
Not its job
Audit-grade correctness
Within 1 to 2 percent
Within audit tolerance (basis points)
Schema drift
Allow-additive policy applied via Iceberg evolve_schema
FROMTuesdayremainsAT1.2millionINstreaming.Reconciliation:runsWednesdayAT02:00.Readscanonicalsettlementsystem,findsthelatefill.Re-aggregatesTuesday14:23minutebarto1.247millionnotional.INSERTOVERWRITEreplacesthestreamingvalueforthatminute_bucket.Delta=47,000notional,loggedINreconciliationauditTABLE.Algorithmicdesk:seesthecorrectedvalueWHENqueryingTuesday's history Wednesday morning.'
The two cases capture the structural property of the design. Cases inside the streaming budget are handled by the streaming pipeline alone. Cases outside the budget are handled by the reconciliation pass. No event is silently lost. Every late event is either caught by the streaming window or surfaced in the next reconciliation run, and the delta between the two outputs is logged as a metric.
The Schema Dimension
The trade fill event payload changes too. The settlement system added a tax_jurisdiction field two months ago and renamed clearing_venue to clearing_facility a month before that. The CDC pipeline applied an allow-additive policy: the tax_jurisdiction addition flowed through automatically and downstream Iceberg tables grew the new column. The clearing_facility rename was quarantined; the on-call engineer ran a coordinated rename via expand-contract over four weeks, with both columns dual-written in the settlement system before the streaming pipeline cut over. The history of those changes lives in the schema-history topic and the reconciliation log.
How the prior lessons compose into this system:
▸Lesson 2 streaming/batch: the streaming pipeline serves the live desk, the batch reconciliation serves the audit consumer; both freshness tiers coexist
▸Lesson 3 storage layer: Iceberg's atomic partition overwrite is what makes the daily reconciliation safe to deploy
▸Lesson 5 idempotency: partition overwrite is the idempotent write pattern that lets reconciliation re-run without harm
▸Lesson 7 contracts: the data contract between settlement system and platform names the schema, freshness, and reconciliation cadence
▸This lesson: schema policy applied to CDC, watermark plus allowed lateness applied to streaming, reconciliation applied to close the gap
1
/* Compute end-of-day reconciliation delta for the example */
2
WITHstreamingAS(
3
SELECT
4
TIMESTAMP'2026-04-21 14:23:00'ASbucket,
5
'GS'ASsymbol,
6
1200000ASnotional
7
8
UNIONALL
9
10
SELECT
11
TIMESTAMP'2026-04-21 14:24:00',
12
'GS',
13
1180000
14
15
UNIONALL
16
17
SELECT
18
TIMESTAMP'2026-04-21 14:25:00',
19
'GS',
20
1210000
21
),
22
settledAS(
23
SELECT
24
TIMESTAMP'2026-04-21 14:23:00'ASbucket,
25
'GS'ASsymbol,
26
1247000ASnotional
27
28
UNIONALL
29
30
SELECT
31
TIMESTAMP'2026-04-21 14:24:00',
32
'GS',
33
1180000
34
35
UNIONALL
36
37
SELECT
38
TIMESTAMP'2026-04-21 14:25:00',
39
'GS',
40
1210000
41
)
42
43
SELECT
44
s.bucket,
45
s.symbol,
46
s.notionalASstreaming,
47
c.notionalAScanonical,
48
c.notional-s.notionalASdelta,
49
ROUND(
50
100*(
51
c.notional-s.notional
52
)/NULLIF(c.notional,0),
53
4
54
)ASdelta_bps
55
FROMstreamingASs
56
INNERJOINsettledAScUSING(bucket,symbol)
57
ORDERBYs.bucket
The delta of 3.77 basis points on the 14:23 bucket is the audit signal. Most days it is zero. Days where the settlement system books a meaningfully late trade produce a non-zero delta. The reconciliation log accumulates these deltas, and a separate operational dashboard tracks the distribution. A persistent shift in the delta distribution is the leading indicator that something has changed: a producer has slowed down, a clearing partner has begun batching, or the streaming pipeline's lateness budget needs revisiting.
The composite design is the senior-engineer answer to the late-data question at scale. Streaming and batch are not alternatives; they are layers, each chosen for what it is good at, with an explicit handoff at the lateness boundary.
✓Do
Make the streaming pipeline and reconciliation pass write to the same target with a clear ownership rule per period
Track the reconciliation delta as a first-class operational metric with alerting thresholds
Document the lateness boundary as part of the data contract: 'streaming for current day, reconciliation for closed days'
✗Don't
Try to make streaming alone audit-grade; the cost asymptote is unfavorable
Run reconciliation without an audit log; the delta is the entire signal
Allow the two pipelines to disagree on partition keys; the overwrite must be a clean replacement
TIP
Bridge move: senior engineers describe this design as 'streaming for the current open period, reconciliation for closed periods, with the lateness budget as the boundary.' That single sentence captures the architecture and is the most concise way to state the principle when the conversation turns from theory to a real system.
❯❯❯PUTTING IT ALL TOGETHER
> A staff data engineer joins a payments fintech that runs a real-time dashboard for merchant settlement amounts plus a regulatory reporting system that files daily totals. The streaming pipeline ingests payment events from a Kafka topic fed by Debezium CDC against the operational Postgres. Three issues are reported the first week. First, the operational team renamed a column upstream and the streaming aggregation produced NULL revenue for two hours. Second, the daily regulatory file consistently disagrees with the streaming dashboard by 0.4 to 1.1 percent. Third, the engineering manager asks for a single design document that prevents recurrence. The design must compose schema policy, streaming guarantees, and a reconciliation pass.
The two-hour NULL revenue incident was a CDC schema policy failure. The rename should have been quarantined under the policies named in this lesson. The fix is to flip the schema-change handling in the Debezium connector to allow-additive plus quarantine-breaking, and to plan upstream renames as expand-contract migrations from the intermediate tier.
The 0.4 to 1.1 percent gap is the streaming pipeline's allowed-lateness tail combined with the reconciliation gap. The streaming pipeline at the firm runs with a 30-minute allowed lateness; events past that are dropped. A daily reconciliation pass that re-aggregates from the canonical settlement source closes the gap and overwrites the regulatory-grade output for closed days.
The reconciliation pass is built on the storage-layer foundations from lesson 3 (Iceberg atomic partition overwrite), the idempotency property from lesson 5 (overwrite is idempotent), and the data-contract framing from lesson 7 (the streaming and batch pipelines are co-owners of the regulatory output, with named ownership per partition).
The composite design from this lesson, plus the streaming concepts from lesson 2 and the storage evolution concepts from lesson 3, is the answer. Streaming serves real-time consumers within an explicit accuracy budget. Reconciliation serves audit-grade consumers from canonical sources. The schema registry plus CDC enforcement policy keeps both pipelines from being broken by upstream change. The reconciliation log is the metric that confirms the system is operating within its accuracy budget continuously.
KEY TAKEAWAYS
CDC schema enforcement is a policy decision, not a default: allow-additive flows safe changes through, block-breaking halts on destructive changes, quarantine routes incompatible rows for human review.
Watermark strategy follows the source's profile: bounded out-of-orderness, ascending, punctuated, and idle-aware variants each fit specific lateness and idleness shapes.
Allowed lateness trades engine state for streaming coverage: the curve has an asymptote near 100 percent that cannot be crossed by widening the streaming window alone.
Reconciliation passes close the gap streaming structurally cannot close: a periodic batch job that overwrites the streaming output with canonical values for closed periods, with the delta logged as an operational metric.
The composite design is the production answer: streaming for the current open period, reconciliation for closed periods, the lateness budget as the boundary, and the delta metric as the proof of accuracy.
At scale, schema policy and lateness budget become first-class design choices, not afterthoughts
Category
Pipeline Architecture
Difficulty
advanced
Duration
40 minutes
Challenges
0 hands-on challenges
Topics covered: Schema Enforcement in CDC Pipelines, Watermarks for Irregular Streams, Allowed Lateness Versus Accuracy, Reconciliation Passes, Late Trade Settlements
Change Data Capture pipelines turn DDL changes upstream into data events downstream. When an operational database schema changes, the CDC pipeline does not get to vote. The change is observed, serialized, and shipped as part of the event stream. This makes CDC pipelines the most schema-fragile category of pipeline in production. A single ALTER TABLE in a payments database can cascade into hundreds of downstream consumers in the time it takes the change-data event to traverse the broker. What Mak
The intermediate tier introduced bounded out-of-orderness as the default watermark strategy. In production, very few real streams behave the way that strategy assumes. Streams from mobile devices have multi-modal lateness distributions. Streams from IoT sensors have idle periods longer than any reasonable timeout. Streams from financial systems carry markers that explicitly declare segment boundaries. Picking a watermark strategy is a per-source design decision, not a default. The Strategy Catal
Allowed lateness is a budget. The engine holds window state for some configurable duration after the watermark closes the window, accepting and merging late events during that hold. The longer the budget, the more state the engine consumes. The shorter the budget, the more events get dropped past the boundary. There is no setting that gets both. The work is choosing the right point on the curve for the workload, and knowing what to do for the events past the boundary. The Tradeoff Curve The righ
A reconciliation pass is a periodic batch job that reads the canonical source data over a closed historical window and overwrites the corresponding rows of the streaming output. The streaming pipeline serves real-time consumers with approximate-but-fast output. The reconciliation pass serves audit-grade consumers with eventually-correct output. Both write to the same target. The contract is that the latest writer wins per partition, and the reconciliation runs late enough that even the long-tail
A capital-markets data platform delivers per-symbol minute bars to algorithmic trading desks. The bars must be available within seconds of the minute closing. They must also be reconcilable to the firm's official settlement system within audit tolerance. Trade fills sometimes settle late: cross-border trades that book through a foreign settlement window can land back-dated by a full business day. The design composes a streaming pipeline, a schema policy, and a reconciliation pass to satisfy both