Schema Evolution and Late Data: Advanced

A capital-markets data platform at a top-five investment bank ran a streaming pipeline that aggregated trade fills into per-symbol minute bars used by the firm's algorithmic desks. The streaming engine ran with a 30-second watermark and 5-minute allowed lateness. Trades that settled across the international date line, however, sometimes booked back to T-1 a full business day later. Those trades were dropped by the streaming engine and silently missing from the bars consumed by the algos. The discrepancy was caught only when a regulator's audit reconciled trade blotters to position reports and found a 0.03 percent gap. The fix was a nightly reconciliation pass: a batch job that re-aggregated the prior day's trades from the canonical settlement system and overwrote the corresponding minute bars in the streaming sink. The streaming pipeline serves real-time consumers; the batch pass serves the audit-grade truth. Both run, both write to the same table, and the contract is that the latest writer wins per minute bar. This lesson covers the production tradeoffs that drive these designs: how schemas evolve in CDC pipelines where DDL upstream becomes data downstream, how watermarks behave on irregular event streams, how lateness budgets trade memory for correctness, and how reconciliation jobs close the gap that streaming engines structurally cannot close.

What you will be able to do

Apply schema enforcement policies to CDC pipelines and route incompatible changes safely

Choose a watermark strategy appropriate to the irregularity profile of an event stream

Design a reconciliation pass that complements streaming output with audit-grade correctness

Schema Enforcement in CDC Pipelines

Daily Life

Interviews

Apply additive, block-breaking, or quarantine policies to a CDC pipeline and reason about the operational tradeoffs of each.

Change Data Capture pipelines turn DDL changes upstream into data events downstream. When an operational database schema changes, the CDC pipeline does not get to vote. The change is observed, serialized, and shipped as part of the event stream. This makes CDC pipelines the most schema-fragile category of pipeline in production. A single ALTER TABLE in a payments database can cascade into hundreds of downstream consumers in the time it takes the change-data event to traverse the broker.

What Makes CDC Schema-Fragile

Property	Implication for Schema
DDL is data	Each schema change becomes an event in the change stream that downstream must process
Producer cannot pre-validate	The DBA does not consult a registry before running ALTER TABLE; the change is already in the WAL
Mixed compatibility shapes	Old rows in the stream still carry the old shape; new rows carry the new shape
Cascading downstream	Lakehouse tables, search indexes, ML feature pipelines all read from the same change stream
Operational urgency	Production DBs change fast; CDC pipelines must absorb or block within minutes

Debezium, AWS DMS, Maxwell, and similar tools all surface schema-change events as part of the stream. Each tool encodes the change differently, but the pattern is the same: a record arrives in the topic indicating that the upstream table now has a new column, a renamed column, or a dropped column. Downstream consumers see the schema-change event before they see the next data event under the new shape. The window of time between the two events is the policy decision space.

The Three Enforcement Policies

Allow AdditiveBlock BreakingQuarantine Incompatible

Allow Additive

The pragmatic default

New columns flow through. Drops and renames are blocked or quarantined. The CDC pipeline keeps running for the additive case, which is more than 80 percent of real-world changes.

Block Breaking

The strict policy

Any non-additive change halts the pipeline. An on-call engineer is paged to evaluate before any data flows under the new schema. Used for audit-grade systems where silent drift is unacceptable.

Quarantine Incompatible

The middle ground

Incompatible rows are routed to a quarantine topic. The main pipeline continues with the additive subset. A human reviews quarantine before merging. Common in lakehouse ingestion.

Policy choice depends on consumer tolerance. A search index ingestion pipeline can run under allow-additive: a missing column degrades search relevance gracefully. A regulatory reporting pipeline cannot: a missing column corrupts a filing. A lakehouse ingestion pipeline often runs under quarantine: incompatible rows are not lost, but they do not flow into the canonical curated layer until the schema decision is made.

Implementation in Debezium and Iceberg

	// Debezium connector configuration with schema-change topic enabled
	{
	"name": "payments-cdc",
	"config": {
	"connector.class": "io.debezium.connector.postgresql.PostgresConnector",
	"database.hostname": "prod-db.example.com",
	"database.dbname": "payments",
	"plugin.name": "pgoutput",
	"slot.name": "debezium_slot",
	"publication.name": "debezium_pub",
	"include.schema.changes": true,
	"schema.history.internal.kafka.topic": "_schema_history.payments",
	"key.converter.schemas.enable": true,
	"value.converter": "io.confluent.connect.avro.AvroConverter",
	"value.converter.schema.registry.url": "http://registry:8081",
	"value.converter.schema.compatibility.level": "BACKWARD"
	}
	}

	// The schema registry's BACKWARD setting blocks producer-side incompatible changes.
	// Schema-change events flow through the schema-change topic for downstream consumers.

	# Iceberg target with allow-additive policy on the sink side
	from pyiceberg.catalog import load_catalog

	catalog = load_catalog('prod')
	table = catalog.load_table('payments.transactions')

	def apply_cdc_event(event):
	if event['op'] == 'SCHEMA_CHANGE':
	change = event['payload']
	if change['kind'] == 'ADD_COLUMN':
	with table.update_schema() as update:
	update.add_column(change['name'], change['type'])
	elif change['kind'] in ('DROP_COLUMN', 'RENAME_COLUMN', 'TYPE_CHANGE'):
	# Block: route the entire change-data event to quarantine
	quarantine.put(event)
	page_oncall(
	f"Incompatible CDC change for transactions: {change['kind']}"
	)
	raise CdcPolicyViolation('non-additive change blocked')
	else:
	quarantine.put(event)
	else:
	table.append(event['payload'])

What enforcement policies must specify:

▸Which DDL classes are allowed without a human in the loop
▸Where blocked events go (DLQ, schema-quarantine topic, halt-and-page)
▸Who owns the on-call response and within what SLA
▸How the policy interacts with the schema registry's compatibility setting

•No Schema Policy

Every DDL change is a potential outage
Drift is silent until consumer dashboards break
On-call response is improvised under pressure
Recovery requires reconstructing the change history from logs

✓Explicit Policy

Additive changes flow through automatically; the team handles only the rare destructive case
Drift is impossible: the policy either accepts, blocks, or quarantines explicitly
On-call response is rehearsed: the policy names the action
Schema-change events are an audit trail by construction

	/* Classify a sample stream of CDC schema-change events */
	WITH events AS (
	SELECT
	1 AS evt_id,
	'transactions' AS tbl,
	'ADD_COLUMN' AS kind,
	'merchant_country' AS field

	UNION ALL

	SELECT
	2,
	'transactions',
	'ADD_COLUMN',
	'risk_score'

	UNION ALL

	SELECT
	3,
	'transactions',
	'RENAME_COLUMN',
	'amount_cents -> amount_minor'

	UNION ALL

	SELECT
	4,
	'transactions',
	'DROP_COLUMN',
	'legacy_flag'

	UNION ALL

	SELECT
	5,
	'transactions',
	'TYPE_CHANGE',
	'amount: int -> bigint'
	)

	SELECT
	evt_id,
	tbl,
	kind,
	field,
	CASE
	WHEN kind = 'ADD_COLUMN' THEN 'apply'
	WHEN kind = 'TYPE_CHANGE'
	AND field LIKE '%int -> bigint%' THEN 'apply_widening'
	WHEN kind IN (
	'RENAME_COLUMN',
	'DROP_COLUMN'
	) THEN 'quarantine_and_page'
	ELSE 'quarantine'
	END AS policy_action
	FROM events
	ORDER BY evt_id

The most expensive CDC outages happen when the upstream DBA assumes the data team will absorb a destructive change overnight. With a quarantine policy, the change does not absorb silently. It surfaces as a halted topic and a page within minutes.

The cost-benefit calculation for each policy depends on how often destructive changes happen and what the consumer side can tolerate. A platform that ingests from twenty operational databases owned by ten different teams will see a destructive change every few weeks. With allow-additive plus quarantine, the operations cost is bounded: the on-call engineer reviews the quarantine queue, validates that the change was intentional, and either applies it manually or works with the upstream team to revert. With block-breaking, the same change halts ingestion entirely, which protects audit consumers but disrupts every consumer downstream. With no policy at all, the change flows through and silently corrupts downstream state. The right choice depends on which kind of pain the team can absorb most cheaply, and that answer is rarely the same across all subjects in the platform.

The schema-history topic deserves its own attention. Debezium emits a parallel topic that records every DDL statement observed on the source database, in order. The reason is that to replay the change stream from any historical offset, the engine must know what schema each record was written under. The history topic is what makes replay deterministic. A team that disables schema history saves a small amount of broker storage and gives up the ability to reprocess history correctly; the trade is rarely worth it. The history topic is the same idea as a schema registry, applied at the connector level rather than the topic level.

✓Do

Separate schema-change events from data events in the topic layout
Pair the registry's compatibility mode with a downstream sink policy that matches
Put a runbook entry next to every quarantine action: who responds, in what SLA, with what tools

✗Don't

Treat ALTER TABLE upstream as a data team problem; it is a contract violation
Allow rename and drop to flow through silently; the cost shows up in audit
Skip the schema-history topic; without it, replays cannot reconstruct the right shape per offset

TIP

When designing CDC enforcement, write the runbook for the destructive case before writing the connector configuration. The connector is the easy part; the operational response to a blocked change is what the policy actually buys.

Watermarks for Irregular Streams

Daily Life

Interviews

Choose a watermark strategy appropriate to a source's lateness and idleness profile, and reason about the failure modes of each.

The intermediate tier introduced bounded out-of-orderness as the default watermark strategy. In production, very few real streams behave the way that strategy assumes. Streams from mobile devices have multi-modal lateness distributions. Streams from IoT sensors have idle periods longer than any reasonable timeout. Streams from financial systems carry markers that explicitly declare segment boundaries. Picking a watermark strategy is a per-source design decision, not a default.

The Strategy Catalog

Strategy	Watermark Computation	Failure Mode If Misapplied
Ascending timestamps	watermark = max event_time observed	Any out-of-order event is dropped; brittle in real systems
Bounded out-of-orderness	watermark = max event_time - fixed_lag	Long-tail late events past the lag are dropped silently
Punctuated	watermark advances on explicit marker events from producer	Stalls forever if producer fails to emit markers
Periodic with idle detection	Bounded out-of-orderness plus advance-on-idle when no events arrive	Idle-advance can leak past genuine slow producers
Per-key with global aggregation	Each key tracks its own watermark; global = min across keys	One slow key can stall every output globally

Each strategy makes a different bet about what the stream looks like. Picking the wrong strategy means either dropping legitimate events or stalling the pipeline. The choice should follow from a measured profile of the source: lateness percentiles, idle-period distribution, presence or absence of explicit markers.

When Each Strategy Wins

Ascending timestamps: a single ordered Kafka topic where the producer is the database WAL and offsets imply order.

Bounded out-of-orderness: web-tier event streams with a stable, narrow lateness distribution (clickstream, server logs).

Punctuated: financial market data with end-of-session markers; advertising auctions with end-of-window beacons.

Periodic with idle detection: IoT telemetry where some sensors go quiet for hours but the pipeline must keep producing output.

Per-key watermarks: a tempting trap; correctness wins per key, but global aggregations stall on the slowest key.

The Idle Source Problem

A stream from a single Kafka partition that goes quiet does not advance the watermark. If a downstream aggregation depends on this partition, it stalls until the partition produces an event. In a multi-source pipeline, the global watermark is the minimum across sources, so one idle source freezes the entire output. The fix is idle-detection: after a configurable timeout with no events, the engine treats the partition as caught up and advances its watermark on a wall-clock heartbeat.

	# Flink watermark strategy with idle detection on a multi-source stream
	from pyflink.common import WatermarkStrategy, Duration

	strategy = (
	WatermarkStrategy
	.for_bounded_out_of_orderness(Duration.of_seconds(30))
	.with_idleness(Duration.of_minutes(2)) # idle source advances after 2 min
	.with_timestamp_assigner(EventTimeAssigner())
	)

	# Without with_idleness, an idle partition would freeze the global watermark.
	# With it, after 2 minutes of silence the partition is treated as caught up.
	# Risk: a genuinely slow producer is mistaken for an idle one; events arriving
	# after the idle-advance are now late, even if the partition was not really idle.

Punctuated Watermarks in Detail

	Punctuated watermarks : the producer emits explicit watermark events.Stream of trade events plus END - of - session marker : { type = 'trade', symbol = 'AAPL', event_time = 09 : 30 : 01, price =...} { type = 'trade', symbol = 'AAPL', event_time = 09 : 30 : 03, price =...} { type = 'trade', symbol = 'AAPL', event_time = 09 : 30 : 08, price =...} { type = 'session_close', session_end = 16 : 00 : 00 } < - watermark advance The consumer treats the session_close event AS a watermark advance to 16 : 00 : 00. ALL trades WITH event_time <= 16 : 00 : 00 are now accounted for.Windows whose END IS <= 16 : 00 : 00 close immediately


	ON this marker.Trade - off : relies

	ON the producer to emit markers reliably.If the producer misses one, the watermark stalls until the next marker arrives.

•Bounded Out-of-Orderness

Watermark = max_event_time - fixed_lag
Engine computes watermark from observed events
Late tail past fixed_lag is dropped
Tunable on the consumer side

✓Punctuated

Watermark = explicit marker events from producer
Engine trusts producer to declare segment boundaries
Late tail caught precisely when producer says session is done
Tunable only by changing producer logic

Watermark profiling checklist before choosing a strategy:

▸Measure lateness percentiles: 50th, 95th, 99th, 99.9th from production data
▸Identify idle-period distribution: how often do partitions go quiet, for how long
▸Check producer capability: can it emit explicit markers, or is the engine on its own
▸Map global vs per-key correctness needs: where does aggregation happen

Per-key watermarks deserve a closer look because they are the most appealing-but-dangerous strategy in the catalog. The appeal is obvious: each user, device, or partition has its own progress, so an idle key does not stall an active one. The danger is that any global aggregation must take the minimum watermark across all keys, which means the slowest key controls the whole output. In a system with millions of keys and a long tail of stragglers, the global watermark stalls almost always. The strategy works in practice only when aggregations are scoped per key (no global view) or when the engine supports per-key emission with key-scoped watermarks downstream. Reaching for per-key watermarks without understanding this constraint is a common cause of streaming pipelines that mysteriously stop emitting output entirely.

Idle detection has its own subtle failure mode. A genuine slow producer that emits one event per hour looks identical to an idle producer that has stopped entirely. If the idle timeout is set to thirty minutes, the slow producer's event arrives after the engine has already advanced the watermark and closed the relevant window. The slow event is now late by definition, even though it was on time from the producer's perspective. The fix is to tune the idle timeout against the slowest legitimate producer's interval, not against the median. This calibration is often the difference between a streaming pipeline that handles real-world traffic and one that quietly drops slow producers' contributions.

The Apache Beam programming model abstracts watermark generation behind the WatermarkEstimator interface. Custom estimators are common in production: a financial firm may write an estimator that combines bounded out-of-orderness with punctuated markers from the trading session calendar, advancing aggressively when a session is known to be closed and conservatively otherwise. The cost of the custom estimator is justified when the source has a non-default lateness profile.

TIP

Watermark strategy is a pipeline-design decision, not a deploy-time tweak. Profile the source for at least a week of production data before locking in numbers. The lateness distribution rarely matches the team's intuition.

Source

producer

compat check

registry

Kafka

topic

Transform

consumer

Storage

warehouse

Schema evolution stays safe when a compatibility check sits between producer and topic: adding a column is backward-compatible and passes; a rename or type change is breaking and is caught before it reaches consumers.

Allowed Lateness Versus Accuracy

Daily Life

Interviews

Quantify the tradeoff between allowed-lateness budget, engine state cost, and streaming-output coverage, and choose a percentile target accordingly.

Allowed lateness is a budget. The engine holds window state for some configurable duration after the watermark closes the window, accepting and merging late events during that hold. The longer the budget, the more state the engine consumes. The shorter the budget, the more events get dropped past the boundary. There is no setting that gets both. The work is choosing the right point on the curve for the workload, and knowing what to do for the events past the boundary.

The Tradeoff Curve

Allowed Lateness	Coverage	State Cost	Output Stability
0 seconds	Drops everything past watermark	Minimum: state released at window close	Maximum: window result is final on first emit
5 minutes	Catches network-level lateness	Low: state held briefly	High: late re-fires are rare
1 hour	Catches most mobile-tail lateness	Moderate: state size grows linearly with window count	Moderate: late re-fires happen but are bounded
24 hours	Catches multi-hour producer outages	High: 1440 minutes of windows held simultaneously	Low: results may keep changing all day
Unbounded	Catches every late event ever	Catastrophic: state grows without bound	None: no result is ever final

The right point depends on what the consumer does with the output. A real-time pricing display can tolerate occasional updates within an hour but not after the trading session closes. An audit-grade ledger needs eventual completeness, but the streaming pipeline does not have to deliver it; a separate reconciliation job can. A monitoring metric needs none of this; processing time is fine.

State Cost Math

Approximate state cost for accumulating - mode windows : state_size ~(allowed_lateness / window_size) * per_window_state Example : window_size = 5 minutes allowed_lateness = 60 minutes per_window_state = 8 KB per active KEY(aggregation totals + metadata) active keys = 50, 000 windows held = 60 / 5 = 12 simultaneously state per KEY = 12 * 8 KB = 96 KB total state = 50, 000 * 96 KB = ~ 4.8 GB Doubling allowed lateness to 120 minutes doubles the state to 9.6 GB.Tripling KEY CARDINALITY to 150, 000 brings state to ~ 14.4 GB AT 60 minutes OR ~ 28.8 GB AT 120 minutes.State lives IN the engine 's RocksDB-style store. Spillover to disk is
possible but slow; sustained checkpointing of large state is its own
bottleneck.'

Coverage Math

Lateness is a long-tail distribution in nearly every real source. The 95th percentile may be a few seconds; the 99th may be a few minutes; the 99.9th may be hours; the worst case may be days. Allowed lateness draws a line on this distribution. Events on the left side of the line are caught by the streaming pipeline. Events on the right side need a different mechanism. The decision is not 'how late can events be' but 'where should the streaming pipeline stop, given that the rest will be handled elsewhere.'

	/* Compute coverage at a sample lateness budget */
	WITH sample_lateness_seconds AS (
	SELECT
	0 AS evt

	UNION ALL

	SELECT
	2

	UNION ALL

	SELECT
	5

	UNION ALL

	SELECT
	12

	UNION ALL

	SELECT
	30

	UNION ALL

	SELECT
	60

	UNION ALL

	SELECT
	240

	UNION ALL

	SELECT
	900

	UNION ALL

	SELECT
	3600

	UNION ALL

	SELECT
	14400

	UNION ALL

	SELECT
	86400
	),
	budgets AS (
	SELECT
	60 AS budget_seconds

	UNION ALL

	SELECT
	300

	UNION ALL

	SELECT
	3600

	UNION ALL

	SELECT
	86400
	)

	SELECT
	budget_seconds,
	COUNT(*) FILTER (
	WHERE evt <= budget_seconds
	) AS caught,
	COUNT(*) AS total,
	ROUND(
	100 * COUNT(*) FILTER (
	WHERE evt <= budget_seconds
	) / COUNT(*),
	1
	) AS pct_covered
	FROM sample_lateness_seconds
	CROSS JOIN budgets
	GROUP BY budget_seconds
	ORDER BY budget_seconds

On this synthetic distribution, even a 1-hour budget catches only 82 percent of events. A 24-hour budget catches all of them, at proportionally higher state cost. The numbers in production look similar: the long tail is heavy enough that no reasonable streaming budget gets to 100 percent. The streaming pipeline gets close; the reconciliation pass gets the last percent.

How to choose the lateness budget:

▸Profile production lateness for at least one full week of representative traffic
▸Pick a percentile target (95th for most consumer dashboards, 99th for important reports)
▸Compute the state cost at that percentile and verify it fits the engine budget
▸Plan the reconciliation job for everything past the budget; this is not optional

Tight BudgetModerate BudgetWide Budget

Tight Budget

Zero to five minutes

Minimum engine state, near-final results on first emit. Drops every event beyond the budget; relies on a separate reconciliation pass for completeness.

Moderate Budget

Thirty to ninety minutes

Catches the bulk of mobile retry tail and brief producer outages. Engine state grows linearly with budget; downstream sinks must support upserts on the window key.

Wide Budget

Hours to a day

Approaches eventual completeness for most workloads. State and checkpoint costs become significant; recovery from outages slows. Rarely the right answer when reconciliation is available.

•Stretch Lateness Budget

More events caught by streaming
Engine state grows linearly with budget
Output keeps updating; consumers need idempotent sinks
Window completeness still bounded by budget

✓Reconcile in Batch

Streaming budget stays small; engine cost stable
Late events handled by a separate batch job
Streaming output is approximate; batch output is canonical
Eventual completeness is unbounded; reconciliation can scan any history

Allowed lateness interacts with engine checkpointing in ways that often surprise teams new to streaming at scale. Every active window holds state that must be checkpointed periodically for fault tolerance. Larger lateness budgets mean more concurrent active windows, which mean larger checkpoints, which mean longer checkpoint durations, which mean longer recovery times when the engine restarts. A pipeline configured with a 24-hour lateness budget can take tens of minutes to restart from a checkpoint at high cardinality, during which the streaming output is stale. The recovery time is itself a kind of lateness; widening the lateness budget to chase correctness can make the engine itself slower to recover from outages, which produces a different kind of late data.

Stretching the lateness budget to chase the last percent of correctness is the most common antipattern in streaming design. The cost grows linearly while coverage approaches an asymptote. The right answer is to accept the asymptote and pair the streaming pipeline with a reconciliation job.

✓Do

Pick a percentile target tied to consumer requirements, not to engine capacity
Treat the streaming output as approximate when the budget is below the 99.9th percentile
Capacity-plan engine state at the chosen budget, including 2x headroom for traffic spikes

✗Don't

Stretch lateness past the point where state cost dominates engine compute
Assume the streaming pipeline alone produces audit-grade output
Tune budgets without measuring the source's actual distribution

TIP

When a downstream consumer demands eventual completeness, deliver it with batch reconciliation, not by widening the streaming window. Reconciliation is bounded by storage, not by engine memory. Streaming budget is bounded by engine memory. The two have different scaling limits, and the right design uses each for what it is good at.

Reconciliation Passes

Daily Life

Interviews

Design a reconciliation pass that complements a streaming pipeline with audit-grade correctness and produces an explicit delta metric.

A reconciliation pass is a periodic batch job that reads the canonical source data over a closed historical window and overwrites the corresponding rows of the streaming output. The streaming pipeline serves real-time consumers with approximate-but-fast output. The reconciliation pass serves audit-grade consumers with eventually-correct output. Both write to the same target. The contract is that the latest writer wins per partition, and the reconciliation runs late enough that even the long-tail late events are settled before it begins.

What a Reconciliation Pass Does

Step	Action
Wait	Hold off until the lateness window for the period is fully past (e.g., next-day at 2am)
Read source of truth	Pull from the system that owns the data (operational DB, settlement system, raw lake)
Re-aggregate	Compute the same aggregation the streaming pipeline computes, over the closed window
Compare	Diff against the streaming output for the period; record the delta as a metric
Overwrite	Replace the streaming output for that period atomically (partition overwrite or MERGE)
Audit	Persist the reconciliation log: what changed, by how much, when, and why

The pass is straightforward in concept. The complexity is in the contract with the streaming pipeline. The streaming pipeline must be designed to be overwritten without breaking real-time consumers. The reconciliation must be designed to write atomically so consumers do not see partial state. The two pipelines must agree on the partition keys and time windows so the overwrite operation is well-defined.

The Atomic Overwrite Pattern




	CREATE OR REPLACE TEMPORARY TABLE recon_yesterday AS
	SELECT
	DATE_TRUNC('minute', settlement_time) AS minute_bucket,
	symbol,
	SUM(quantity * price) AS notional,
	COUNT(*) AS trade_count
	FROM canonical.settlement_system
	WHERE settlement_time >= CURRENT_DATE - INTERVAL '1 day' AND settlement_time < CURRENT_DATE
	GROUP BY 1, 2 ; INSERT OVERWRITE INTO streaming.minute_bars PARTITION(minute_bucket)


	SELECT
	minute_bucket,
	symbol,
	notional,
	trade_count
	FROM recon_yesterday ; INSERT INTO ops.reconciliation_log


	SELECT
	CURRENT_TIMESTAMP AS run_at,
	'minute_bars' AS target,
	COUNT(*) AS rows_overwritten,
	SUM(ABS(s.notional - r.notional)) AS delta_notional
	FROM streaming.minute_bars s
	JOIN recon_yesterday r
	USING(minute_bucket, symbol) ;

Why It Cannot Be Skipped

The streaming pipeline drops events past allowed lateness. Without reconciliation, those events are silently missing from the answer.

Schema policy in CDC may quarantine events. Reconciliation against the source-of-truth captures them once the schema decision is made.

Reconciliation produces an explicit delta metric. The delta is the streaming pipeline's accuracy, measured rather than assumed.

Reconciliation closes the audit loop: 'the streaming output for any closed period agrees with the source of truth' is a property the team can prove, not hope for.

Cadence and Window

Cadence	Window	Use Case
Hourly	Last hour, run with 1-hour delay	Operational metrics where low-latency reconciliation matters
Daily	Yesterday, run at 2am next day	Most reporting workloads; aligns with downstream batch SLAs
Weekly	Prior week, run Monday morning	Audit-grade reporting where weekly close is the contract
Monthly	Prior month, run on day 1 of next month	Regulatory filings; revenue recognition; period-close systems

Cadence aligns with the consumer's reporting horizon. A daily revenue dashboard reconciles overnight. A regulatory filing reconciles monthly. The window must be long enough that the source of truth has settled, which usually means the reconciliation runs at least one full lateness budget after the streaming output for that period closes.

Design constraints for the reconciliation pass:

▸Idempotent: re-running must produce the same answer; partition overwrite is the simplest form
▸Atomic: consumers never observe partial state during the overwrite
▸Observable: every run produces a delta metric, persisted in an audit log
▸Bounded compute: the pass scans only the window it reconciles, never full history
▸Coordinated with streaming: the streaming pipeline must tolerate being overwritten

	/* Compute the reconciliation delta for a sample period */
	WITH streaming AS (
	SELECT
	TIMESTAMP '2026-04-24 09:30:00' AS minute_bucket,
	'AAPL' AS symbol,
	100000 AS notional

	UNION ALL

	SELECT
	TIMESTAMP '2026-04-24 09:30:00',
	'GOOG',
	85000

	UNION ALL

	SELECT
	TIMESTAMP '2026-04-24 09:31:00',
	'AAPL',
	72000

	UNION ALL

	SELECT
	TIMESTAMP '2026-04-24 09:31:00',
	'GOOG',
	91000
	),
	recon AS (
	SELECT
	TIMESTAMP '2026-04-24 09:30:00' AS minute_bucket,
	'AAPL' AS symbol,
	100200 AS notional /* late trade caught */

	UNION ALL

	SELECT
	TIMESTAMP '2026-04-24 09:30:00',
	'GOOG',
	85000

	UNION ALL

	SELECT
	TIMESTAMP '2026-04-24 09:31:00',
	'AAPL',
	72000

	UNION ALL

	SELECT
	TIMESTAMP '2026-04-24 09:31:00',
	'GOOG',
	91450 /* late trade caught */
	)

	SELECT
	s.minute_bucket,
	s.symbol,
	s.notional AS streaming_notional,
	r.notional AS canonical_notional,
	r.notional - s.notional AS delta
	FROM streaming AS s
	INNER JOIN recon AS r USING (minute_bucket, symbol)
	ORDER BY s.minute_bucket, s.symbol

•Streaming Only

Real-time but approximate output
Late tail silently dropped past the budget
No measurable accuracy guarantee
Audit-grade consumers must build their own truth

✓Streaming + Reconciliation

Real-time output for live consumers
Audit-grade output for closed periods
Delta metric quantifies streaming accuracy continuously
Audit-grade consumers read from the same target table

The reconciliation log is one of the most important operational artifacts in any streaming system. Its trend over time tells the team whether the streaming pipeline is degrading, the lateness profile is shifting, or upstream behavior has changed.

✓Do

Run reconciliation at the cadence the audit consumer needs, not at the cadence the streaming team prefers
Persist the delta metric with rich dimensions (time, key, source) so investigations can root-cause
Test that consumers can read while the overwrite is in flight; transactional table formats matter here

✗Don't

Skip the audit log; without it, the team cannot prove what was reconciled when
Reconcile inside the streaming engine; batch tools are the right home for full-window scans
Let the streaming pipeline write outside its allowed-lateness window; the contract relies on that boundary

TIP

When designing the streaming pipeline and the reconciliation pass together, decide which one owns each minute of history. The streaming pipeline owns the open period; the reconciliation owns closed periods. The handoff is the moment the lateness window expires.

Late Trade Settlements

Daily Life

Interviews

Compose schema policy, watermark strategy, allowed lateness, and reconciliation into a system that satisfies real-time and audit-grade consumers at the same time.

A capital-markets data platform delivers per-symbol minute bars to algorithmic trading desks. The bars must be available within seconds of the minute closing. They must also be reconcilable to the firm's official settlement system within audit tolerance. Trade fills sometimes settle late: cross-border trades that book through a foreign settlement window can land back-dated by a full business day. The design composes a streaming pipeline, a schema policy, and a reconciliation pass to satisfy both consumer needs.

The Architecture

Trade fill events Kafka | | CDC SCHEMA policy : allow - additive, quarantine breaking v Schema-validated trade stream | | bounded out - of - orderness, watermark = MAX - 30 s | 5 - minute event - TIME tumbling windows | allowed lateness = 60 minutes, accumulating mode v [ Streaming aggregate : minute_bars TABLE, partitioned BY minute_bucket ] ^ | INSERT OVERWRITE for yesterday 's partitions, atomically       
          |                                                                
[Daily reconciliation pass: read settlement system, re-aggregate]         
          ^                                                                
          |                                                                
[Settlement system: source of truth, produced T+1]                        '

The streaming pipeline serves the algorithmic desks during market hours. The reconciliation pass overwrites yesterday's bars overnight, after the settlement system has finalized. The minute_bars table has two writers; the contract is that the daily reconciliation is canonical for any closed day, and the streaming pipeline owns the current day.

What Each Layer Handles

Concern	Streaming Pipeline	Reconciliation Pass
Real-time consumer SLA	Sub-minute freshness	Not its job
Audit-grade correctness	Within 1 to 2 percent	Within audit tolerance (basis points)
Schema drift	Allow-additive policy applied via Iceberg evolve_schema	Reads canonical settlement system schema directly
Late trade fills	Caught up to 60 minutes late, dropped past that	Caught at any lateness up to T+1
Output stability	May re-fire within the lateness window	Writes once per day, atomic partition overwrite
Compute cost	Continuous, bounded engine state	Bounded scan of yesterday's settlement window

The Two Failure Cases the Design Handles

	CASE A : A trade fill settles 14 minutes late.Event TIME : 14 : 23 : 08 Arrival TIME : 14 : 37 : 12 Lateness : 14 minutes 4 seconds Streaming pipeline : WITHIN 60 - minute allowed lateness.The minute_bars ROW for 14 : 23 was already emitted WITH 1.2 million notional.Late event admitted, WINDOW re - fires WITH updated notional 1.247 million.Algorithmic desk reads the updated value WITHIN seconds.Reconciliation : confirms the streaming value the next morning.Delta = 0. CASE B : A




	CROSS - border trade fill settles next - day AT T + 1. Event TIME : 14 : 23 : 08

	ON Tuesday Arrival TIME : 09 : 14 : 00
	ON Wednesday(settled overnight after FX confirmation) Lateness : 18 hours 50 minutes Streaming pipeline : past 60 - minute allowed lateness.Event dropped

	FROM streaming output.Routed to a late - event DLQ for visibility.The 14 : 23 minute_bar
	FROM Tuesday remains AT 1.2 million IN streaming.Reconciliation : runs Wednesday AT 02 : 00. Reads canonical settlement system, finds the late fill.Re - aggregates Tuesday 14 : 23 minute bar to 1.247 million notional.INSERT OVERWRITE replaces the streaming value for that minute_bucket.Delta = 47, 000 notional, logged IN reconciliation audit TABLE.Algorithmic desk : sees the corrected value WHEN querying Tuesday 's history Wednesday morning.'

The two cases capture the structural property of the design. Cases inside the streaming budget are handled by the streaming pipeline alone. Cases outside the budget are handled by the reconciliation pass. No event is silently lost. Every late event is either caught by the streaming window or surfaced in the next reconciliation run, and the delta between the two outputs is logged as a metric.

The Schema Dimension

The trade fill event payload changes too. The settlement system added a tax_jurisdiction field two months ago and renamed clearing_venue to clearing_facility a month before that. The CDC pipeline applied an allow-additive policy: the tax_jurisdiction addition flowed through automatically and downstream Iceberg tables grew the new column. The clearing_facility rename was quarantined; the on-call engineer ran a coordinated rename via expand-contract over four weeks, with both columns dual-written in the settlement system before the streaming pipeline cut over. The history of those changes lives in the schema-history topic and the reconciliation log.

How the prior lessons compose into this system:

▸Lesson 2 streaming/batch: the streaming pipeline serves the live desk, the batch reconciliation serves the audit consumer; both freshness tiers coexist
▸Lesson 3 storage layer: Iceberg's atomic partition overwrite is what makes the daily reconciliation safe to deploy
▸Lesson 5 idempotency: partition overwrite is the idempotent write pattern that lets reconciliation re-run without harm
▸Lesson 7 contracts: the data contract between settlement system and platform names the schema, freshness, and reconciliation cadence
▸This lesson: schema policy applied to CDC, watermark plus allowed lateness applied to streaming, reconciliation applied to close the gap

	/* Compute end-of-day reconciliation delta for the example */
	WITH streaming AS (
	SELECT
	TIMESTAMP '2026-04-21 14:23:00' AS bucket,
	'GS' AS symbol,
	1200000 AS notional

	UNION ALL

	SELECT
	TIMESTAMP '2026-04-21 14:24:00',
	'GS',
	1180000

	UNION ALL

	SELECT
	TIMESTAMP '2026-04-21 14:25:00',
	'GS',
	1210000
	),
	settled AS (
	SELECT
	TIMESTAMP '2026-04-21 14:23:00' AS bucket,
	'GS' AS symbol,
	1247000 AS notional

	UNION ALL

	SELECT
	TIMESTAMP '2026-04-21 14:24:00',
	'GS',
	1180000

	UNION ALL

	SELECT
	TIMESTAMP '2026-04-21 14:25:00',
	'GS',
	1210000
	)

	SELECT
	s.bucket,
	s.symbol,
	s.notional AS streaming,
	c.notional AS canonical,
	c.notional - s.notional AS delta,
	ROUND(
	100 * (
	c.notional - s.notional
	) / NULLIF(c.notional, 0),
	4
	) AS delta_bps
	FROM streaming AS s
	INNER JOIN settled AS c USING (bucket, symbol)
	ORDER BY s.bucket

The delta of 3.77 basis points on the 14:23 bucket is the audit signal. Most days it is zero. Days where the settlement system books a meaningfully late trade produce a non-zero delta. The reconciliation log accumulates these deltas, and a separate operational dashboard tracks the distribution. A persistent shift in the delta distribution is the leading indicator that something has changed: a producer has slowed down, a clearing partner has begun batching, or the streaming pipeline's lateness budget needs revisiting.

The composite design is the senior-engineer answer to the late-data question at scale. Streaming and batch are not alternatives; they are layers, each chosen for what it is good at, with an explicit handoff at the lateness boundary.

✓Do

Make the streaming pipeline and reconciliation pass write to the same target with a clear ownership rule per period
Track the reconciliation delta as a first-class operational metric with alerting thresholds
Document the lateness boundary as part of the data contract: 'streaming for current day, reconciliation for closed days'

✗Don't

Try to make streaming alone audit-grade; the cost asymptote is unfavorable
Run reconciliation without an audit log; the delta is the entire signal
Allow the two pipelines to disagree on partition keys; the overwrite must be a clean replacement

TIP

Bridge move: senior engineers describe this design as 'streaming for the current open period, reconciliation for closed periods, with the lateness budget as the boundary.' That single sentence captures the architecture and is the most concise way to state the principle when the conversation turns from theory to a real system.

❯❯❯PUTTING IT ALL TOGETHER

> A staff data engineer joins a payments fintech that runs a real-time dashboard for merchant settlement amounts plus a regulatory reporting system that files daily totals. The streaming pipeline ingests payment events from a Kafka topic fed by Debezium CDC against the operational Postgres. Three issues are reported the first week. First, the operational team renamed a column upstream and the streaming aggregation produced NULL revenue for two hours. Second, the daily regulatory file consistently disagrees with the streaming dashboard by 0.4 to 1.1 percent. Third, the engineering manager asks for a single design document that prevents recurrence. The design must compose schema policy, streaming guarantees, and a reconciliation pass.

The two-hour NULL revenue incident was a CDC schema policy failure. The rename should have been quarantined under the policies named in this lesson. The fix is to flip the schema-change handling in the Debezium connector to allow-additive plus quarantine-breaking, and to plan upstream renames as expand-contract migrations from the intermediate tier.

The 0.4 to 1.1 percent gap is the streaming pipeline's allowed-lateness tail combined with the reconciliation gap. The streaming pipeline at the firm runs with a 30-minute allowed lateness; events past that are dropped. A daily reconciliation pass that re-aggregates from the canonical settlement source closes the gap and overwrites the regulatory-grade output for closed days.

The reconciliation pass is built on the storage-layer foundations from lesson 3 (Iceberg atomic partition overwrite), the idempotency property from lesson 5 (overwrite is idempotent), and the data-contract framing from lesson 7 (the streaming and batch pipelines are co-owners of the regulatory output, with named ownership per partition).

The composite design from this lesson, plus the streaming concepts from lesson 2 and the storage evolution concepts from lesson 3, is the answer. Streaming serves real-time consumers within an explicit accuracy budget. Reconciliation serves audit-grade consumers from canonical sources. The schema registry plus CDC enforcement policy keeps both pipelines from being broken by upstream change. The reconciliation log is the metric that confirms the system is operating within its accuracy budget continuously.

KEY TAKEAWAYS

CDC schema enforcement is a policy decision, not a default: allow-additive flows safe changes through, block-breaking halts on destructive changes, quarantine routes incompatible rows for human review.

Watermark strategy follows the source's profile: bounded out-of-orderness, ascending, punctuated, and idle-aware variants each fit specific lateness and idleness shapes.

Allowed lateness trades engine state for streaming coverage: the curve has an asymptote near 100 percent that cannot be crossed by widening the streaming window alone.

Reconciliation passes close the gap streaming structurally cannot close: a periodic batch job that overwrites the streaming output with canonical values for closed periods, with the delta logged as an operational metric.

The composite design is the production answer: streaming for the current open period, reconciliation for closed periods, the lateness budget as the boundary, and the delta metric as the proof of accuracy.

At scale, schema policy and lateness budget become first-class design choices, not afterthoughts

Category: Pipeline Architecture
Difficulty: advanced
Duration: 40 minutes
Challenges: 0 hands-on challenges

Topics covered: Schema Enforcement in CDC Pipelines, Watermarks for Irregular Streams, Allowed Lateness Versus Accuracy, Reconciliation Passes, Late Trade Settlements

Lesson Sections

Schema Enforcement in CDC Pipelines (concepts: paCdc)
Change Data Capture pipelines turn DDL changes upstream into data events downstream. When an operational database schema changes, the CDC pipeline does not get to vote. The change is observed, serialized, and shipped as part of the event stream. This makes CDC pipelines the most schema-fragile category of pipeline in production. A single ALTER TABLE in a payments database can cascade into hundreds of downstream consumers in the time it takes the change-data event to traverse the broker. What Mak
Watermarks for Irregular Streams (concepts: paStreamProcessing)
The intermediate tier introduced bounded out-of-orderness as the default watermark strategy. In production, very few real streams behave the way that strategy assumes. Streams from mobile devices have multi-modal lateness distributions. Streams from IoT sensors have idle periods longer than any reasonable timeout. Streams from financial systems carry markers that explicitly declare segment boundaries. Picking a watermark strategy is a per-source design decision, not a default. The Strategy Catal
Allowed Lateness Versus Accuracy (concepts: paLateData)
Allowed lateness is a budget. The engine holds window state for some configurable duration after the watermark closes the window, accepting and merging late events during that hold. The longer the budget, the more state the engine consumes. The shorter the budget, the more events get dropped past the boundary. There is no setting that gets both. The work is choosing the right point on the curve for the workload, and knowing what to do for the events past the boundary. The Tradeoff Curve The righ
Reconciliation Passes (concepts: paLambdaArch)
A reconciliation pass is a periodic batch job that reads the canonical source data over a closed historical window and overwrites the corresponding rows of the streaming output. The streaming pipeline serves real-time consumers with approximate-but-fast output. The reconciliation pass serves audit-grade consumers with eventually-correct output. Both write to the same target. The contract is that the latest writer wins per partition, and the reconciliation runs late enough that even the long-tail
Late Trade Settlements (concepts: paLateData)
A capital-markets data platform delivers per-symbol minute bars to algorithmic trading desks. The bars must be available within seconds of the minute closing. They must also be reconcilable to the firm's official settlement system within audit tolerance. Trade fills sometimes settle late: cross-border trades that book through a foreign settlement window can land back-dated by a full business day. The design composes a streaming pipeline, a schema policy, and a reconciliation pass to satisfy both