Schema Evolution and Late Data: Advanced

A capital-markets data platform at a top-five investment bank ran a streaming pipeline that aggregated trade fills into per-symbol minute bars used by the firm's algorithmic desks. The streaming engine ran with a 30-second watermark and 5-minute allowed lateness. Trades that settled across the international date line, however, sometimes booked back to T-1 a full business day later. Those trades were dropped by the streaming engine and silently missing from the bars consumed by the algos. The discrepancy was caught only when a regulator's audit reconciled trade blotters to position reports and found a 0.03 percent gap. The fix was a nightly reconciliation pass: a batch job that re-aggregated the prior day's trades from the canonical settlement system and overwrote the corresponding minute bars in the streaming sink. The streaming pipeline serves real-time consumers; the batch pass serves the audit-grade truth. Both run, both write to the same table, and the contract is that the latest writer wins per minute bar. This lesson covers the production tradeoffs that drive these designs: how schemas evolve in CDC pipelines where DDL upstream becomes data downstream, how watermarks behave on irregular event streams, how lateness budgets trade memory for correctness, and how reconciliation jobs close the gap that streaming engines structurally cannot close.

Schema Enforcement in CDC Pipelines

Daily Life
Interviews

Apply additive, block-breaking, or quarantine policies to a CDC pipeline and reason about the operational tradeoffs of each.

Change Data Capture pipelines turn DDL changes upstream into data events downstream. When an operational database schema changes, the CDC pipeline does not get to vote. The change is observed, serialized, and shipped as part of the event stream. This makes CDC pipelines the most schema-fragile category of pipeline in production. A single ALTER TABLE in a payments database can cascade into hundreds of downstream consumers in the time it takes the change-data event to traverse the broker.

What Makes CDC Schema-Fragile

PropertyImplication for Schema
DDL is dataEach schema change becomes an event in the change stream that downstream must process
Producer cannot pre-validateThe DBA does not consult a registry before running ALTER TABLE; the change is already in the WAL
Mixed compatibility shapesOld rows in the stream still carry the old shape; new rows carry the new shape
Cascading downstreamLakehouse tables, search indexes, ML feature pipelines all read from the same change stream
Operational urgencyProduction DBs change fast; CDC pipelines must absorb or block within minutes
Debezium, AWS DMS, Maxwell, and similar tools all surface schema-change events as part of the stream. Each tool encodes the change differently, but the pattern is the same: a record arrives in the topic indicating that the upstream table now has a new column, a renamed column, or a dropped column. Downstream consumers see the schema-change event before they see the next data event under the new shape. The window of time between the two events is the policy decision space.

The Three Enforcement Policies

Allow AdditiveBlock BreakingQuarantine Incompatible
Allow Additive
The pragmatic default
New columns flow through. Drops and renames are blocked or quarantined. The CDC pipeline keeps running for the additive case, which is more than 80 percent of real-world changes.
Block Breaking
The strict policy
Any non-additive change halts the pipeline. An on-call engineer is paged to evaluate before any data flows under the new schema. Used for audit-grade systems where silent drift is unacceptable.
Quarantine Incompatible
The middle ground
Incompatible rows are routed to a quarantine topic. The main pipeline continues with the additive subset. A human reviews quarantine before merging. Common in lakehouse ingestion.
Policy choice depends on consumer tolerance. A search index ingestion pipeline can run under allow-additive: a missing column degrades search relevance gracefully. A regulatory reporting pipeline cannot: a missing column corrupts a filing. A lakehouse ingestion pipeline often runs under quarantine: incompatible rows are not lost, but they do not flow into the canonical curated layer until the schema decision is made.

Implementation in Debezium and Iceberg

1// Debezium connector configuration with schema-change topic enabled
2{
3 "name": "payments-cdc",
4 "config": {
5 "connector.class": "io.debezium.connector.postgresql.PostgresConnector",
6 "database.hostname": "prod-db.example.com",
7 "database.dbname": "payments",
8 "plugin.name": "pgoutput",
9 "slot.name": "debezium_slot",
10 "publication.name": "debezium_pub",
11 "include.schema.changes": true,
12 "schema.history.internal.kafka.topic": "_schema_history.payments",
13 "key.converter.schemas.enable": true,
14 "value.converter": "io.confluent.connect.avro.AvroConverter",
15 "value.converter.schema.registry.url": "http://registry:8081",
16 "value.converter.schema.compatibility.level": "BACKWARD"
17 }
18}
19
20// The schema registry's BACKWARD setting blocks producer-side incompatible changes.
21// Schema-change events flow through the schema-change topic for downstream consumers.
1# Iceberg target with allow-additive policy on the sink side
2from pyiceberg.catalog import load_catalog
3
4catalog = load_catalog('prod')
5table = catalog.load_table('payments.transactions')
6
7def apply_cdc_event(event):
8 if event['op'] == 'SCHEMA_CHANGE':
9 change = event['payload']
10 if change['kind'] == 'ADD_COLUMN':
11 with table.update_schema() as update:
12 update.add_column(change['name'], change['type'])
13 elif change['kind'] in ('DROP_COLUMN', 'RENAME_COLUMN', 'TYPE_CHANGE'):
14 # Block: route the entire change-data event to quarantine
15 quarantine.put(event)
16 page_oncall(
17 f"Incompatible CDC change for transactions: {change['kind']}"
18 )
19 raise CdcPolicyViolation('non-additive change blocked')
20 else:
21 quarantine.put(event)
22 else:
23 table.append(event['payload'])
What enforcement policies must specify:
  • Which DDL classes are allowed without a human in the loop
  • Where blocked events go (DLQ, schema-quarantine topic, halt-and-page)
  • Who owns the on-call response and within what SLA
  • How the policy interacts with the schema registry's compatibility setting
No Schema Policy
  • Every DDL change is a potential outage
  • Drift is silent until consumer dashboards break
  • On-call response is improvised under pressure
  • Recovery requires reconstructing the change history from logs
Explicit Policy
  • Additive changes flow through automatically; the team handles only the rare destructive case
  • Drift is impossible: the policy either accepts, blocks, or quarantines explicitly
  • On-call response is rehearsed: the policy names the action
  • Schema-change events are an audit trail by construction
1/* Classify a sample stream of CDC schema-change events */
2WITH events AS (
3 SELECT
4 1 AS evt_id,
5 'transactions' AS tbl,
6 'ADD_COLUMN' AS kind,
7 'merchant_country' AS field
8
9 UNION ALL
10
11 SELECT
12 2,
13 'transactions',
14 'ADD_COLUMN',
15 'risk_score'
16
17 UNION ALL
18
19 SELECT
20 3,
21 'transactions',
22 'RENAME_COLUMN',
23 'amount_cents -> amount_minor'
24
25 UNION ALL
26
27 SELECT
28 4,
29 'transactions',
30 'DROP_COLUMN',
31 'legacy_flag'
32
33 UNION ALL
34
35 SELECT
36 5,
37 'transactions',
38 'TYPE_CHANGE',
39 'amount: int -> bigint'
40)
41
42SELECT
43 evt_id,
44 tbl,
45 kind,
46 field,
47 CASE
48 WHEN kind = 'ADD_COLUMN' THEN 'apply'
49 WHEN kind = 'TYPE_CHANGE'
50 AND field LIKE '%int -> bigint%' THEN 'apply_widening'
51 WHEN kind IN (
52 'RENAME_COLUMN',
53 'DROP_COLUMN'
54 ) THEN 'quarantine_and_page'
55 ELSE 'quarantine'
56 END AS policy_action
57FROM events
58ORDER BY evt_id

The most expensive CDC outages happen when the upstream DBA assumes the data team will absorb a destructive change overnight. With a quarantine policy, the change does not absorb silently. It surfaces as a halted topic and a page within minutes.

The cost-benefit calculation for each policy depends on how often destructive changes happen and what the consumer side can tolerate. A platform that ingests from twenty operational databases owned by ten different teams will see a destructive change every few weeks. With allow-additive plus quarantine, the operations cost is bounded: the on-call engineer reviews the quarantine queue, validates that the change was intentional, and either applies it manually or works with the upstream team to revert. With block-breaking, the same change halts ingestion entirely, which protects audit consumers but disrupts every consumer downstream. With no policy at all, the change flows through and silently corrupts downstream state. The right choice depends on which kind of pain the team can absorb most cheaply, and that answer is rarely the same across all subjects in the platform.
The schema-history topic deserves its own attention. Debezium emits a parallel topic that records every DDL statement observed on the source database, in order. The reason is that to replay the change stream from any historical offset, the engine must know what schema each record was written under. The history topic is what makes replay deterministic. A team that disables schema history saves a small amount of broker storage and gives up the ability to reprocess history correctly; the trade is rarely worth it. The history topic is the same idea as a schema registry, applied at the connector level rather than the topic level.
Do
  • Separate schema-change events from data events in the topic layout
  • Pair the registry's compatibility mode with a downstream sink policy that matches
  • Put a runbook entry next to every quarantine action: who responds, in what SLA, with what tools
Don't
  • Treat ALTER TABLE upstream as a data team problem; it is a contract violation
  • Allow rename and drop to flow through silently; the cost shows up in audit
  • Skip the schema-history topic; without it, replays cannot reconstruct the right shape per offset
TIP
When designing CDC enforcement, write the runbook for the destructive case before writing the connector configuration. The connector is the easy part; the operational response to a blocked change is what the policy actually buys.

Watermarks for Irregular Streams

Daily Life
Interviews

Choose a watermark strategy appropriate to a source's lateness and idleness profile, and reason about the failure modes of each.

The intermediate tier introduced bounded out-of-orderness as the default watermark strategy. In production, very few real streams behave the way that strategy assumes. Streams from mobile devices have multi-modal lateness distributions. Streams from IoT sensors have idle periods longer than any reasonable timeout. Streams from financial systems carry markers that explicitly declare segment boundaries. Picking a watermark strategy is a per-source design decision, not a default.

The Strategy Catalog

StrategyWatermark ComputationFailure Mode If Misapplied
Ascending timestampswatermark = max event_time observedAny out-of-order event is dropped; brittle in real systems
Bounded out-of-ordernesswatermark = max event_time - fixed_lagLong-tail late events past the lag are dropped silently
Punctuatedwatermark advances on explicit marker events from producerStalls forever if producer fails to emit markers
Periodic with idle detectionBounded out-of-orderness plus advance-on-idle when no events arriveIdle-advance can leak past genuine slow producers
Per-key with global aggregationEach key tracks its own watermark; global = min across keysOne slow key can stall every output globally
Each strategy makes a different bet about what the stream looks like. Picking the wrong strategy means either dropping legitimate events or stalling the pipeline. The choice should follow from a measured profile of the source: lateness percentiles, idle-period distribution, presence or absence of explicit markers.

When Each Strategy Wins

check
Ascending timestamps: a single ordered Kafka topic where the producer is the database WAL and offsets imply order.
check
Bounded out-of-orderness: web-tier event streams with a stable, narrow lateness distribution (clickstream, server logs).
check
Punctuated: financial market data with end-of-session markers; advertising auctions with end-of-window beacons.
check
Periodic with idle detection: IoT telemetry where some sensors go quiet for hours but the pipeline must keep producing output.
alert
Per-key watermarks: a tempting trap; correctness wins per key, but global aggregations stall on the slowest key.

The Idle Source Problem

A stream from a single Kafka partition that goes quiet does not advance the watermark. If a downstream aggregation depends on this partition, it stalls until the partition produces an event. In a multi-source pipeline, the global watermark is the minimum across sources, so one idle source freezes the entire output. The fix is idle-detection: after a configurable timeout with no events, the engine treats the partition as caught up and advances its watermark on a wall-clock heartbeat.
1# Flink watermark strategy with idle detection on a multi-source stream
2from pyflink.common import WatermarkStrategy, Duration
3
4strategy = (
5 WatermarkStrategy
6 .for_bounded_out_of_orderness(Duration.of_seconds(30))
7 .with_idleness(Duration.of_minutes(2)) # idle source advances after 2 min
8 .with_timestamp_assigner(EventTimeAssigner())
9)
10
11# Without with_idleness, an idle partition would freeze the global watermark.
12# With it, after 2 minutes of silence the partition is treated as caught up.
13# Risk: a genuinely slow producer is mistaken for an idle one; events arriving
14# after the idle-advance are now late, even if the partition was not really idle.

Punctuated Watermarks in Detail

1Punctuated watermarks : the producer emits explicit watermark events.Stream of trade events plus END - of - session marker : { type = 'trade', symbol = 'AAPL', event_time = 09 : 30 : 01, price =...} { type = 'trade', symbol = 'AAPL', event_time = 09 : 30 : 03, price =...} { type = 'trade', symbol = 'AAPL', event_time = 09 : 30 : 08, price =...} { type = 'session_close', session_end = 16 : 00 : 00 } < - watermark advance The consumer treats the session_close event AS a watermark advance to 16 : 00 : 00. ALL trades WITH event_time <= 16 : 00 : 00 are now accounted for.Windows whose END IS <= 16 : 00 : 00 close immediately
2
3
4 ON this marker.Trade - off : relies
5
6 ON the producer to emit markers reliably.If the producer misses one, the watermark stalls until the next marker arrives.
Bounded Out-of-Orderness
  • Watermark = max_event_time - fixed_lag
  • Engine computes watermark from observed events
  • Late tail past fixed_lag is dropped
  • Tunable on the consumer side
Punctuated
  • Watermark = explicit marker events from producer
  • Engine trusts producer to declare segment boundaries
  • Late tail caught precisely when producer says session is done
  • Tunable only by changing producer logic
Watermark profiling checklist before choosing a strategy:
  • Measure lateness percentiles: 50th, 95th, 99th, 99.9th from production data
  • Identify idle-period distribution: how often do partitions go quiet, for how long
  • Check producer capability: can it emit explicit markers, or is the engine on its own
  • Map global vs per-key correctness needs: where does aggregation happen
Per-key watermarks deserve a closer look because they are the most appealing-but-dangerous strategy in the catalog. The appeal is obvious: each user, device, or partition has its own progress, so an idle key does not stall an active one. The danger is that any global aggregation must take the minimum watermark across all keys, which means the slowest key controls the whole output. In a system with millions of keys and a long tail of stragglers, the global watermark stalls almost always. The strategy works in practice only when aggregations are scoped per key (no global view) or when the engine supports per-key emission with key-scoped watermarks downstream. Reaching for per-key watermarks without understanding this constraint is a common cause of streaming pipelines that mysteriously stop emitting output entirely.
Idle detection has its own subtle failure mode. A genuine slow producer that emits one event per hour looks identical to an idle producer that has stopped entirely. If the idle timeout is set to thirty minutes, the slow producer's event arrives after the engine has already advanced the watermark and closed the relevant window. The slow event is now late by definition, even though it was on time from the producer's perspective. The fix is to tune the idle timeout against the slowest legitimate producer's interval, not against the median. This calibration is often the difference between a streaming pipeline that handles real-world traffic and one that quietly drops slow producers' contributions.

The Apache Beam programming model abstracts watermark generation behind the WatermarkEstimator interface. Custom estimators are common in production: a financial firm may write an estimator that combines bounded out-of-orderness with punctuated markers from the trading session calendar, advancing aggressively when a session is known to be closed and conservatively otherwise. The cost of the custom estimator is justified when the source has a non-default lateness profile.

TIP
Watermark strategy is a pipeline-design decision, not a deploy-time tweak. Profile the source for at least a week of production data before locking in numbers. The lateness distribution rarely matches the team's intuition.

Allowed Lateness Versus Accuracy

Daily Life
Interviews

Quantify the tradeoff between allowed-lateness budget, engine state cost, and streaming-output coverage, and choose a percentile target accordingly.

Allowed lateness is a budget. The engine holds window state for some configurable duration after the watermark closes the window, accepting and merging late events during that hold. The longer the budget, the more state the engine consumes. The shorter the budget, the more events get dropped past the boundary. There is no setting that gets both. The work is choosing the right point on the curve for the workload, and knowing what to do for the events past the boundary.

The Tradeoff Curve

Allowed LatenessCoverageState CostOutput Stability
0 secondsDrops everything past watermarkMinimum: state released at window closeMaximum: window result is final on first emit
5 minutesCatches network-level latenessLow: state held brieflyHigh: late re-fires are rare
1 hourCatches most mobile-tail latenessModerate: state size grows linearly with window countModerate: late re-fires happen but are bounded
24 hoursCatches multi-hour producer outagesHigh: 1440 minutes of windows held simultaneouslyLow: results may keep changing all day
UnboundedCatches every late event everCatastrophic: state grows without boundNone: no result is ever final
The right point depends on what the consumer does with the output. A real-time pricing display can tolerate occasional updates within an hour but not after the trading session closes. An audit-grade ledger needs eventual completeness, but the streaming pipeline does not have to deliver it; a separate reconciliation job can. A monitoring metric needs none of this; processing time is fine.

State Cost Math

1Approximate state cost for accumulating - mode windows : state_size ~(allowed_lateness / window_size) * per_window_state Example : window_size = 5 minutes allowed_lateness = 60 minutes per_window_state = 8 KB per active KEY(aggregation totals + metadata) active keys = 50, 000 windows held = 60 / 5 = 12 simultaneously state per KEY = 12 * 8 KB = 96 KB total state = 50, 000 * 96 KB = ~ 4.8 GB Doubling allowed lateness to 120 minutes doubles the state to 9.6 GB.Tripling KEY CARDINALITY to 150, 000 brings state to ~ 14.4 GB AT 60 minutes OR ~ 28.8 GB AT 120 minutes.State lives IN the engine 's RocksDB-style store. Spillover to disk is possible but slow; sustained checkpointing of large state is its own bottleneck.'

Coverage Math

Lateness is a long-tail distribution in nearly every real source. The 95th percentile may be a few seconds; the 99th may be a few minutes; the 99.9th may be hours; the worst case may be days. Allowed lateness draws a line on this distribution. Events on the left side of the line are caught by the streaming pipeline. Events on the right side need a different mechanism. The decision is not 'how late can events be' but 'where should the streaming pipeline stop, given that the rest will be handled elsewhere.'
1/* Compute coverage at a sample lateness budget */
2WITH sample_lateness_seconds AS (
3 SELECT
4 0 AS evt
5
6 UNION ALL
7
8 SELECT
9 2
10
11 UNION ALL
12
13 SELECT
14 5
15
16 UNION ALL
17
18 SELECT
19 12
20
21 UNION ALL
22
23 SELECT
24 30
25
26 UNION ALL
27
28 SELECT
29 60
30
31 UNION ALL
32
33 SELECT
34 240
35
36 UNION ALL
37
38 SELECT
39 900
40
41 UNION ALL
42
43 SELECT
44 3600
45
46 UNION ALL
47
48 SELECT
49 14400
50
51 UNION ALL
52
53 SELECT
54 86400
55),
56budgets AS (
57 SELECT
58 60 AS budget_seconds
59
60 UNION ALL
61
62 SELECT
63 300
64
65 UNION ALL
66
67 SELECT
68 3600
69
70 UNION ALL
71
72 SELECT
73 86400
74)
75
76SELECT
77 budget_seconds,
78 COUNT(*) FILTER (
79 WHERE evt <= budget_seconds
80 ) AS caught,
81 COUNT(*) AS total,
82 ROUND(
83 100 * COUNT(*) FILTER (
84 WHERE evt <= budget_seconds
85 ) / COUNT(*),
86 1
87 ) AS pct_covered
88FROM sample_lateness_seconds
89CROSS JOIN budgets
90GROUP BY budget_seconds
91ORDER BY budget_seconds
On this synthetic distribution, even a 1-hour budget catches only 82 percent of events. A 24-hour budget catches all of them, at proportionally higher state cost. The numbers in production look similar: the long tail is heavy enough that no reasonable streaming budget gets to 100 percent. The streaming pipeline gets close; the reconciliation pass gets the last percent.
How to choose the lateness budget:
  • Profile production lateness for at least one full week of representative traffic
  • Pick a percentile target (95th for most consumer dashboards, 99th for important reports)
  • Compute the state cost at that percentile and verify it fits the engine budget
  • Plan the reconciliation job for everything past the budget; this is not optional
Tight BudgetModerate BudgetWide Budget
Tight Budget
Zero to five minutes
Minimum engine state, near-final results on first emit. Drops every event beyond the budget; relies on a separate reconciliation pass for completeness.
Moderate Budget
Thirty to ninety minutes
Catches the bulk of mobile retry tail and brief producer outages. Engine state grows linearly with budget; downstream sinks must support upserts on the window key.
Wide Budget
Hours to a day
Approaches eventual completeness for most workloads. State and checkpoint costs become significant; recovery from outages slows. Rarely the right answer when reconciliation is available.
Stretch Lateness Budget
  • More events caught by streaming
  • Engine state grows linearly with budget
  • Output keeps updating; consumers need idempotent sinks
  • Window completeness still bounded by budget
Reconcile in Batch
  • Streaming budget stays small; engine cost stable
  • Late events handled by a separate batch job
  • Streaming output is approximate; batch output is canonical
  • Eventual completeness is unbounded; reconciliation can scan any history
Allowed lateness interacts with engine checkpointing in ways that often surprise teams new to streaming at scale. Every active window holds state that must be checkpointed periodically for fault tolerance. Larger lateness budgets mean more concurrent active windows, which mean larger checkpoints, which mean longer checkpoint durations, which mean longer recovery times when the engine restarts. A pipeline configured with a 24-hour lateness budget can take tens of minutes to restart from a checkpoint at high cardinality, during which the streaming output is stale. The recovery time is itself a kind of lateness; widening the lateness budget to chase correctness can make the engine itself slower to recover from outages, which produces a different kind of late data.

Stretching the lateness budget to chase the last percent of correctness is the most common antipattern in streaming design. The cost grows linearly while coverage approaches an asymptote. The right answer is to accept the asymptote and pair the streaming pipeline with a reconciliation job.

Do
  • Pick a percentile target tied to consumer requirements, not to engine capacity
  • Treat the streaming output as approximate when the budget is below the 99.9th percentile
  • Capacity-plan engine state at the chosen budget, including 2x headroom for traffic spikes
Don't
  • Stretch lateness past the point where state cost dominates engine compute
  • Assume the streaming pipeline alone produces audit-grade output
  • Tune budgets without measuring the source's actual distribution
TIP
When a downstream consumer demands eventual completeness, deliver it with batch reconciliation, not by widening the streaming window. Reconciliation is bounded by storage, not by engine memory. Streaming budget is bounded by engine memory. The two have different scaling limits, and the right design uses each for what it is good at.

Reconciliation Passes

Daily Life
Interviews

Design a reconciliation pass that complements a streaming pipeline with audit-grade correctness and produces an explicit delta metric.

A reconciliation pass is a periodic batch job that reads the canonical source data over a closed historical window and overwrites the corresponding rows of the streaming output. The streaming pipeline serves real-time consumers with approximate-but-fast output. The reconciliation pass serves audit-grade consumers with eventually-correct output. Both write to the same target. The contract is that the latest writer wins per partition, and the reconciliation runs late enough that even the long-tail late events are settled before it begins.

What a Reconciliation Pass Does

StepAction
WaitHold off until the lateness window for the period is fully past (e.g., next-day at 2am)
Read source of truthPull from the system that owns the data (operational DB, settlement system, raw lake)
Re-aggregateCompute the same aggregation the streaming pipeline computes, over the closed window
CompareDiff against the streaming output for the period; record the delta as a metric
OverwriteReplace the streaming output for that period atomically (partition overwrite or MERGE)
AuditPersist the reconciliation log: what changed, by how much, when, and why
The pass is straightforward in concept. The complexity is in the contract with the streaming pipeline. The streaming pipeline must be designed to be overwritten without breaking real-time consumers. The reconciliation must be designed to write atomically so consumers do not see partial state. The two pipelines must agree on the partition keys and time windows so the overwrite operation is well-defined.

The Atomic Overwrite Pattern

1
2
3
4CREATE OR REPLACE TEMPORARY TABLE recon_yesterday AS
5SELECT
6 DATE_TRUNC('minute', settlement_time) AS minute_bucket,
7 symbol,
8 SUM(quantity * price) AS notional,
9 COUNT(*) AS trade_count
10FROM canonical.settlement_system
11WHERE settlement_time >= CURRENT_DATE - INTERVAL '1 day' AND settlement_time < CURRENT_DATE
12GROUP BY 1, 2 ; INSERT OVERWRITE INTO streaming.minute_bars PARTITION(minute_bucket)
13
14
15SELECT
16 minute_bucket,
17 symbol,
18 notional,
19 trade_count
20FROM recon_yesterday ; INSERT INTO ops.reconciliation_log
21
22
23SELECT
24 CURRENT_TIMESTAMP AS run_at,
25 'minute_bars' AS target,
26 COUNT(*) AS rows_overwritten,
27 SUM(ABS(s.notional - r.notional)) AS delta_notional
28FROM streaming.minute_bars s
29JOIN recon_yesterday r
30 USING(minute_bucket, symbol) ;

Why It Cannot Be Skipped

alert
The streaming pipeline drops events past allowed lateness. Without reconciliation, those events are silently missing from the answer.
alert
Schema policy in CDC may quarantine events. Reconciliation against the source-of-truth captures them once the schema decision is made.
check
Reconciliation produces an explicit delta metric. The delta is the streaming pipeline's accuracy, measured rather than assumed.
check
Reconciliation closes the audit loop: 'the streaming output for any closed period agrees with the source of truth' is a property the team can prove, not hope for.

Cadence and Window

CadenceWindowUse Case
HourlyLast hour, run with 1-hour delayOperational metrics where low-latency reconciliation matters
DailyYesterday, run at 2am next dayMost reporting workloads; aligns with downstream batch SLAs
WeeklyPrior week, run Monday morningAudit-grade reporting where weekly close is the contract
MonthlyPrior month, run on day 1 of next monthRegulatory filings; revenue recognition; period-close systems
Cadence aligns with the consumer's reporting horizon. A daily revenue dashboard reconciles overnight. A regulatory filing reconciles monthly. The window must be long enough that the source of truth has settled, which usually means the reconciliation runs at least one full lateness budget after the streaming output for that period closes.
Design constraints for the reconciliation pass:
  • Idempotent: re-running must produce the same answer; partition overwrite is the simplest form
  • Atomic: consumers never observe partial state during the overwrite
  • Observable: every run produces a delta metric, persisted in an audit log
  • Bounded compute: the pass scans only the window it reconciles, never full history
  • Coordinated with streaming: the streaming pipeline must tolerate being overwritten
1/* Compute the reconciliation delta for a sample period */
2WITH streaming AS (
3 SELECT
4 TIMESTAMP '2026-04-24 09:30:00' AS minute_bucket,
5 'AAPL' AS symbol,
6 100000 AS notional
7
8 UNION ALL
9
10 SELECT
11 TIMESTAMP '2026-04-24 09:30:00',
12 'GOOG',
13 85000
14
15 UNION ALL
16
17 SELECT
18 TIMESTAMP '2026-04-24 09:31:00',
19 'AAPL',
20 72000
21
22 UNION ALL
23
24 SELECT
25 TIMESTAMP '2026-04-24 09:31:00',
26 'GOOG',
27 91000
28),
29recon AS (
30 SELECT
31 TIMESTAMP '2026-04-24 09:30:00' AS minute_bucket,
32 'AAPL' AS symbol,
33 100200 AS notional /* late trade caught */
34
35 UNION ALL
36
37 SELECT
38 TIMESTAMP '2026-04-24 09:30:00',
39 'GOOG',
40 85000
41
42 UNION ALL
43
44 SELECT
45 TIMESTAMP '2026-04-24 09:31:00',
46 'AAPL',
47 72000
48
49 UNION ALL
50
51 SELECT
52 TIMESTAMP '2026-04-24 09:31:00',
53 'GOOG',
54 91450 /* late trade caught */
55)
56
57SELECT
58 s.minute_bucket,
59 s.symbol,
60 s.notional AS streaming_notional,
61 r.notional AS canonical_notional,
62 r.notional - s.notional AS delta
63FROM streaming AS s
64INNER JOIN recon AS r USING (minute_bucket, symbol)
65ORDER BY s.minute_bucket, s.symbol
Streaming Only
  • Real-time but approximate output
  • Late tail silently dropped past the budget
  • No measurable accuracy guarantee
  • Audit-grade consumers must build their own truth
Streaming + Reconciliation
  • Real-time output for live consumers
  • Audit-grade output for closed periods
  • Delta metric quantifies streaming accuracy continuously
  • Audit-grade consumers read from the same target table

The reconciliation log is one of the most important operational artifacts in any streaming system. Its trend over time tells the team whether the streaming pipeline is degrading, the lateness profile is shifting, or upstream behavior has changed.

Do
  • Run reconciliation at the cadence the audit consumer needs, not at the cadence the streaming team prefers
  • Persist the delta metric with rich dimensions (time, key, source) so investigations can root-cause
  • Test that consumers can read while the overwrite is in flight; transactional table formats matter here
Don't
  • Skip the audit log; without it, the team cannot prove what was reconciled when
  • Reconcile inside the streaming engine; batch tools are the right home for full-window scans
  • Let the streaming pipeline write outside its allowed-lateness window; the contract relies on that boundary
TIP
When designing the streaming pipeline and the reconciliation pass together, decide which one owns each minute of history. The streaming pipeline owns the open period; the reconciliation owns closed periods. The handoff is the moment the lateness window expires.

Late Trade Settlements

Daily Life
Interviews

Compose schema policy, watermark strategy, allowed lateness, and reconciliation into a system that satisfies real-time and audit-grade consumers at the same time.

A capital-markets data platform delivers per-symbol minute bars to algorithmic trading desks. The bars must be available within seconds of the minute closing. They must also be reconcilable to the firm's official settlement system within audit tolerance. Trade fills sometimes settle late: cross-border trades that book through a foreign settlement window can land back-dated by a full business day. The design composes a streaming pipeline, a schema policy, and a reconciliation pass to satisfy both consumer needs.

The Architecture

1Trade fill events Kafka | | CDC SCHEMA policy : allow - additive, quarantine breaking v Schema-validated trade stream | | bounded out - of - orderness, watermark = MAX - 30 s | 5 - minute event - TIME tumbling windows | allowed lateness = 60 minutes, accumulating mode v [ Streaming aggregate : minute_bars TABLE, partitioned BY minute_bucket ] ^ | INSERT OVERWRITE for yesterday 's partitions, atomically | [Daily reconciliation pass: read settlement system, re-aggregate] ^ | [Settlement system: source of truth, produced T+1] '
The streaming pipeline serves the algorithmic desks during market hours. The reconciliation pass overwrites yesterday's bars overnight, after the settlement system has finalized. The minute_bars table has two writers; the contract is that the daily reconciliation is canonical for any closed day, and the streaming pipeline owns the current day.

What Each Layer Handles

ConcernStreaming PipelineReconciliation Pass
Real-time consumer SLASub-minute freshnessNot its job
Audit-grade correctnessWithin 1 to 2 percentWithin audit tolerance (basis points)
Schema driftAllow-additive policy applied via Iceberg evolve_schemaReads canonical settlement system schema directly
Late trade fillsCaught up to 60 minutes late, dropped past thatCaught at any lateness up to T+1
Output stabilityMay re-fire within the lateness windowWrites once per day, atomic partition overwrite
Compute costContinuous, bounded engine stateBounded scan of yesterday's settlement window

The Two Failure Cases the Design Handles

1CASE A : A trade fill settles 14 minutes late.Event TIME : 14 : 23 : 08 Arrival TIME : 14 : 37 : 12 Lateness : 14 minutes 4 seconds Streaming pipeline : WITHIN 60 - minute allowed lateness.The minute_bars ROW for 14 : 23 was already emitted WITH 1.2 million notional.Late event admitted, WINDOW re - fires WITH updated notional 1.247 million.Algorithmic desk reads the updated value WITHIN seconds.Reconciliation : confirms the streaming value the next morning.Delta = 0. CASE B : A
2
3
4
5
6CROSS - border trade fill settles next - day AT T + 1. Event TIME : 14 : 23 : 08
7
8 ON Tuesday Arrival TIME : 09 : 14 : 00
9 ON Wednesday(settled overnight after FX confirmation) Lateness : 18 hours 50 minutes Streaming pipeline : past 60 - minute allowed lateness.Event dropped
10
11FROM streaming output.Routed to a late - event DLQ for visibility.The 14 : 23 minute_bar
12FROM Tuesday remains AT 1.2 million IN streaming.Reconciliation : runs Wednesday AT 02 : 00. Reads canonical settlement system, finds the late fill.Re - aggregates Tuesday 14 : 23 minute bar to 1.247 million notional.INSERT OVERWRITE replaces the streaming value for that minute_bucket.Delta = 47, 000 notional, logged IN reconciliation audit TABLE.Algorithmic desk : sees the corrected value WHEN querying Tuesday 's history Wednesday morning.'
The two cases capture the structural property of the design. Cases inside the streaming budget are handled by the streaming pipeline alone. Cases outside the budget are handled by the reconciliation pass. No event is silently lost. Every late event is either caught by the streaming window or surfaced in the next reconciliation run, and the delta between the two outputs is logged as a metric.

The Schema Dimension

The trade fill event payload changes too. The settlement system added a tax_jurisdiction field two months ago and renamed clearing_venue to clearing_facility a month before that. The CDC pipeline applied an allow-additive policy: the tax_jurisdiction addition flowed through automatically and downstream Iceberg tables grew the new column. The clearing_facility rename was quarantined; the on-call engineer ran a coordinated rename via expand-contract over four weeks, with both columns dual-written in the settlement system before the streaming pipeline cut over. The history of those changes lives in the schema-history topic and the reconciliation log.
How the prior lessons compose into this system:
  • Lesson 2 streaming/batch: the streaming pipeline serves the live desk, the batch reconciliation serves the audit consumer; both freshness tiers coexist
  • Lesson 3 storage layer: Iceberg's atomic partition overwrite is what makes the daily reconciliation safe to deploy
  • Lesson 5 idempotency: partition overwrite is the idempotent write pattern that lets reconciliation re-run without harm
  • Lesson 7 contracts: the data contract between settlement system and platform names the schema, freshness, and reconciliation cadence
  • This lesson: schema policy applied to CDC, watermark plus allowed lateness applied to streaming, reconciliation applied to close the gap
1/* Compute end-of-day reconciliation delta for the example */
2WITH streaming AS (
3 SELECT
4 TIMESTAMP '2026-04-21 14:23:00' AS bucket,
5 'GS' AS symbol,
6 1200000 AS notional
7
8 UNION ALL
9
10 SELECT
11 TIMESTAMP '2026-04-21 14:24:00',
12 'GS',
13 1180000
14
15 UNION ALL
16
17 SELECT
18 TIMESTAMP '2026-04-21 14:25:00',
19 'GS',
20 1210000
21),
22settled AS (
23 SELECT
24 TIMESTAMP '2026-04-21 14:23:00' AS bucket,
25 'GS' AS symbol,
26 1247000 AS notional
27
28 UNION ALL
29
30 SELECT
31 TIMESTAMP '2026-04-21 14:24:00',
32 'GS',
33 1180000
34
35 UNION ALL
36
37 SELECT
38 TIMESTAMP '2026-04-21 14:25:00',
39 'GS',
40 1210000
41)
42
43SELECT
44 s.bucket,
45 s.symbol,
46 s.notional AS streaming,
47 c.notional AS canonical,
48 c.notional - s.notional AS delta,
49 ROUND(
50 100 * (
51 c.notional - s.notional
52 ) / NULLIF(c.notional, 0),
53 4
54 ) AS delta_bps
55FROM streaming AS s
56INNER JOIN settled AS c USING (bucket, symbol)
57ORDER BY s.bucket

The delta of 3.77 basis points on the 14:23 bucket is the audit signal. Most days it is zero. Days where the settlement system books a meaningfully late trade produce a non-zero delta. The reconciliation log accumulates these deltas, and a separate operational dashboard tracks the distribution. A persistent shift in the delta distribution is the leading indicator that something has changed: a producer has slowed down, a clearing partner has begun batching, or the streaming pipeline's lateness budget needs revisiting.

The composite design is the senior-engineer answer to the late-data question at scale. Streaming and batch are not alternatives; they are layers, each chosen for what it is good at, with an explicit handoff at the lateness boundary.

Do
  • Make the streaming pipeline and reconciliation pass write to the same target with a clear ownership rule per period
  • Track the reconciliation delta as a first-class operational metric with alerting thresholds
  • Document the lateness boundary as part of the data contract: 'streaming for current day, reconciliation for closed days'
Don't
  • Try to make streaming alone audit-grade; the cost asymptote is unfavorable
  • Run reconciliation without an audit log; the delta is the entire signal
  • Allow the two pipelines to disagree on partition keys; the overwrite must be a clean replacement
TIP
Bridge move: senior engineers describe this design as 'streaming for the current open period, reconciliation for closed periods, with the lateness budget as the boundary.' That single sentence captures the architecture and is the most concise way to state the principle when the conversation turns from theory to a real system.
PUTTING IT ALL TOGETHER

> A staff data engineer joins a payments fintech that runs a real-time dashboard for merchant settlement amounts plus a regulatory reporting system that files daily totals. The streaming pipeline ingests payment events from a Kafka topic fed by Debezium CDC against the operational Postgres. Three issues are reported the first week. First, the operational team renamed a column upstream and the streaming aggregation produced NULL revenue for two hours. Second, the daily regulatory file consistently disagrees with the streaming dashboard by 0.4 to 1.1 percent. Third, the engineering manager asks for a single design document that prevents recurrence. The design must compose schema policy, streaming guarantees, and a reconciliation pass.

The two-hour NULL revenue incident was a CDC schema policy failure. The rename should have been quarantined under the policies named in this lesson. The fix is to flip the schema-change handling in the Debezium connector to allow-additive plus quarantine-breaking, and to plan upstream renames as expand-contract migrations from the intermediate tier.
The 0.4 to 1.1 percent gap is the streaming pipeline's allowed-lateness tail combined with the reconciliation gap. The streaming pipeline at the firm runs with a 30-minute allowed lateness; events past that are dropped. A daily reconciliation pass that re-aggregates from the canonical settlement source closes the gap and overwrites the regulatory-grade output for closed days.
The reconciliation pass is built on the storage-layer foundations from lesson 3 (Iceberg atomic partition overwrite), the idempotency property from lesson 5 (overwrite is idempotent), and the data-contract framing from lesson 7 (the streaming and batch pipelines are co-owners of the regulatory output, with named ownership per partition).
The composite design from this lesson, plus the streaming concepts from lesson 2 and the storage evolution concepts from lesson 3, is the answer. Streaming serves real-time consumers within an explicit accuracy budget. Reconciliation serves audit-grade consumers from canonical sources. The schema registry plus CDC enforcement policy keeps both pipelines from being broken by upstream change. The reconciliation log is the metric that confirms the system is operating within its accuracy budget continuously.
KEY TAKEAWAYS
CDC schema enforcement is a policy decision, not a default: allow-additive flows safe changes through, block-breaking halts on destructive changes, quarantine routes incompatible rows for human review.
Watermark strategy follows the source's profile: bounded out-of-orderness, ascending, punctuated, and idle-aware variants each fit specific lateness and idleness shapes.
Allowed lateness trades engine state for streaming coverage: the curve has an asymptote near 100 percent that cannot be crossed by widening the streaming window alone.
Reconciliation passes close the gap streaming structurally cannot close: a periodic batch job that overwrites the streaming output with canonical values for closed periods, with the delta logged as an operational metric.
The composite design is the production answer: streaming for the current open period, reconciliation for closed periods, the lateness budget as the boundary, and the delta metric as the proof of accuracy.

At scale, schema policy and lateness budget become first-class design choices, not afterthoughts

Category
Pipeline Architecture
Difficulty
advanced
Duration
40 minutes
Challenges
0 hands-on challenges

Topics covered: Schema Enforcement in CDC Pipelines, Watermarks for Irregular Streams, Allowed Lateness Versus Accuracy, Reconciliation Passes, Late Trade Settlements

Lesson Sections

  1. Schema Enforcement in CDC Pipelines (concepts: paCdcSchemaPolicy, paSchemaQuarantine)

    Change Data Capture pipelines turn DDL changes upstream into data events downstream. When an operational database schema changes, the CDC pipeline does not get to vote. The change is observed, serialized, and shipped as part of the event stream. This makes CDC pipelines the most schema-fragile category of pipeline in production. A single ALTER TABLE in a payments database can cascade into hundreds of downstream consumers in the time it takes the change-data event to traverse the broker. What Mak

  2. Watermarks for Irregular Streams (concepts: paWatermarkStrategies)

    The intermediate tier introduced bounded out-of-orderness as the default watermark strategy. In production, very few real streams behave the way that strategy assumes. Streams from mobile devices have multi-modal lateness distributions. Streams from IoT sensors have idle periods longer than any reasonable timeout. Streams from financial systems carry markers that explicitly declare segment boundaries. Picking a watermark strategy is a per-source design decision, not a default. The Strategy Catal

  3. Allowed Lateness Versus Accuracy (concepts: paAllowedLatenessTradeoff)

    Allowed lateness is a budget. The engine holds window state for some configurable duration after the watermark closes the window, accepting and merging late events during that hold. The longer the budget, the more state the engine consumes. The shorter the budget, the more events get dropped past the boundary. There is no setting that gets both. The work is choosing the right point on the curve for the workload, and knowing what to do for the events past the boundary. The Tradeoff Curve The righ

  4. Reconciliation Passes (concepts: paReconciliation)

    A reconciliation pass is a periodic batch job that reads the canonical source data over a closed historical window and overwrites the corresponding rows of the streaming output. The streaming pipeline serves real-time consumers with approximate-but-fast output. The reconciliation pass serves audit-grade consumers with eventually-correct output. Both write to the same target. The contract is that the latest writer wins per partition, and the reconciliation runs late enough that even the long-tail

  5. Late Trade Settlements (concepts: paReconciliation, paStreamingPlusBatch, paLateData)

    A capital-markets data platform delivers per-symbol minute bars to algorithmic trading desks. The bars must be available within seconds of the minute closing. They must also be reconcilable to the firm's official settlement system within audit tolerance. Trade fills sometimes settle late: cross-border trades that book through a foreign settlement window can land back-dated by a full business day. The design composes a streaming pipeline, a schema policy, and a reconciliation pass to satisfy both