Idempotency and Backfill: Advanced

A payment processor at scale ran a streaming aggregator that summed daily revenue across all merchants. The system advertised exactly-once semantics. One Sunday afternoon a Kafka broker hard-crashed and the consumer group rebalanced. The aggregator emitted a small percentage of duplicate sums in the seconds around the rebalance. The duplicates were within the engine's exactly-once guarantee for sink writes, but they violated the merchant-facing API's contract that revenue figures were monotonic. Two days of regulatory reports had to be reissued. The post-incident finding was that exactly-once at the engine level was not exactly-once at the system level, and the gap had been papered over with vendor marketing language. This lesson is about the engineering reality behind idempotency in streaming systems: where the guarantees actually exist, where they shade into effectively-once, and what infrastructure is required to make replay safe at the boundary between exactly-once-mattering domains (financial close) and at-least-once-fine domains (page view counters).

What you will be able to do

Distinguish exactly-once from effectively-once and identify which is achievable in which contexts

Apply two-phase commits, transactional outbox, and idempotent consumers as the building blocks of streaming idempotency

Design replay infrastructure that supports reprocessing from a known offset or timestamp without violating downstream contracts

Idempotency in Streaming Is Harder

Daily Life

Interviews

Recognize why streaming has no clean partition boundary and identify which streaming patterns are naturally idempotent versus which require explicit engineering.

Batch idempotency rests on a clean boundary: the partition. The pipeline owns a unit of work, the unit corresponds to a slice of the destination, and the slice can be replaced atomically. Streaming has no equivalent. Events arrive continuously; the destination is being written to continuously; there is no obvious moment at which to draw a boundary and say 'the work for this window is now complete and can be replaced.' Streaming idempotency exists, but it is engineered, not inherent, and the engineering is significantly more complex than the batch case.

Why the Partition Trick Does Not Apply

Property	Batch	Streaming
Boundary of one run	A clean partition (date, hour, fifteen-minute window)	No clean boundary; processing is continuous
Atomic write per run	INSERT OVERWRITE PARTITION; CREATE OR REPLACE	Each event write is independent; partition-level atomicity does not exist
Retry semantics	Rerun the partition; previous output replaced	Reprocess from an offset; previous output may already be downstream
Backfill	Loop over date partitions; each replaces its slice	Replay from a known offset; downstream must absorb duplicates

The At-Least-Once Default

Most streaming infrastructure defaults to at-least-once delivery. Kafka commits offsets after the consumer signals it has processed a batch, which means a consumer that crashes between processing and committing will reprocess the same events on restart. The same is true of Kinesis, Pub/Sub, and most queue systems. At-least-once is the easy guarantee to provide because it requires only that the broker not lose messages; the broker is not responsible for whether the consumer applies them once or twice. Idempotency on top of at-least-once is the work that turns the broker's easy promise into a system-level guarantee.

	# A naive Kafka consumer that is not idempotent
	# Simulating a crash between processing and offset commit

	counter = {'page_views': 0}
	events = [{'page_id': 'home', 'event_id': 'evt_1'}, {'page_id': 'home', 'event_id': 'evt_2'}]
	offset_committed = -1

	# First attempt: process evt_1, crash before commit
	counter['page_views'] += 1 # processed
	# offset_committed never updated; simulated crash

	# Restart: consumer resumes from last committed offset (-1), so evt_1 reprocesses
	counter['page_views'] += 1 # re-processed evt_1
	offset_committed = 0
	counter['page_views'] += 1 # processed evt_2
	offset_committed = 1

	print(f'Counter after at-least-once redelivery: {counter}')
	print(f'Expected count for 2 unique events: 2')
	print(f'Actual count: {counter["page_views"]} (one duplicate)')

The two operations (state update and offset commit) are not atomic. A crash between them produces a duplicate update on the next run. The fix is to make either the update idempotent (so the duplicate is harmless) or the pair atomic (so the crash never produces a half-state). Both fixes are non-trivial. Both are the subject of this lesson.

Three Sources of Streaming Duplicates

Where the duplicates come from:

▸Producer retries: a producer that does not get an ack republishes the same event
▸Consumer crashes between processing and offset commit
▸Consumer group rebalances: partition ownership shifts mid-batch and the new owner starts from the last committed offset

The Boundary Problem

Even when the engine guarantees exactly-once writes to its own state store (Flink's checkpointed state, Kafka Streams' internal topics), the boundary where state crosses into a different system is the place duplicates leak. Writing to a Postgres table, calling an external API, sending a message to another Kafka cluster: each is a boundary where the engine's guarantee ends and the next system's guarantee begins. The end-to-end guarantee is the weakest link in the chain, not the strongest.

Boundary	Engine Guarantee	End-to-End Guarantee
Flink internal state -> Flink internal state	Exactly-once via checkpoints	Exactly-once (single engine)
Flink -> Kafka sink	Exactly-once via two-phase commit (when configured)	Exactly-once if downstream consumer reads the committed offsets
Flink -> Postgres	Idempotent UPSERT or two-phase commit	Exactly-once only if the sink itself is idempotent
Flink -> external HTTP API	Engine cannot offer exactly-once; the API is unaware of the transaction	At-least-once at best; the API must be idempotent on its side

Where Streaming Idempotency Is Easier

Streaming idempotency is not always hard. Aggregations whose output is naturally a function of the entire input (a sum, a count, a max) can use stateful windowing with periodic checkpoints; the engine restores state on recovery and reprocesses only the events after the checkpoint. UPSERT-friendly destinations (key-value stores like Redis or DynamoDB, transactional databases with primary keys) absorb at-least-once events naturally because the second write is a no-op. Stateless transforms whose output is deterministic in the input (filter, project, enrich with reference data) are trivially idempotent. The hard cases are stateful transforms that emit to non-idempotent sinks, which is where exactly-once advertising and engineering reality diverge.

✓Easy Streaming Idempotency

Stateless filter, project, enrich operations
Aggregations with checkpointed state and idempotent sink
UPSERT-friendly destination keyed by event ID
Append-only log destination with deduplication on read

•Hard Streaming Idempotency

Stateful aggregations writing to a non-idempotent sink
Multi-system transactions (database AND queue AND external API)
Side effects with no idempotency token (sending email, charging a card)
Joins across two streams with different progress rates

Streaming has no clean partition boundary; idempotency is engineered rather than inherent.

At-least-once is the default; idempotency on the consumer side turns it into effective once.

End-to-end guarantees are bounded by the weakest link, which is almost always the cross-system boundary.

TIP

When evaluating a streaming system's idempotency, draw the boundary lines explicitly. Inside a single engine, exactly-once is often achievable. Across boundaries, the conversation is about effectively-once and which side carries the deduplication burden.

Exactly-Once vs Effectively-Once

Daily Life

Interviews

Distinguish exactly-once from effectively-once, audit a vendor claim against the actual deployment, and choose the right pattern for the system in question.

Exactly-once is one of the most loaded phrases in streaming. Vendor marketing has used it for so long that the engineering meaning has eroded. The honest framing: exactly-once is achievable inside a closed system where the engine controls every read, write, and offset commit. End-to-end exactly-once across systems is generally not achievable; what gets advertised under that name is more precisely called effectively-once, which is at-least-once delivery combined with idempotent consumers. The distinction is not pedantry; it determines what failure modes are possible and where the deduplication work lives.

Three Delivery Guarantees

Guarantee	Meaning	Typical Where
At-most-once	Each event is delivered zero or one times	Fire-and-forget logging; metrics where some loss is tolerable
At-least-once	Each event is delivered one or more times	Default for Kafka, Kinesis, Pub/Sub; strong guarantee that no event is lost
Exactly-once	Each event is processed once and only once at the boundary in question	Achievable inside a single engine; harder across boundaries

What Exactly-Once Actually Guarantees

Inside a single Flink job with checkpointed state and a transactional sink, exactly-once means that the effect of each input event on the engine's output is applied once even across failures. The mechanism is two-phase commit: the engine writes to the sink in a transaction, the transaction is committed only when the corresponding checkpoint completes, and a failure rolls back any uncommitted output. Kafka Streams provides a similar guarantee when both source and sink are Kafka topics participating in the same transaction. The guarantee is real, but it is bounded to the systems participating in the transaction.

What Effectively-Once Means

Effectively-once is the system-level outcome where each event affects the final state once, achieved by combining at-least-once delivery with idempotent processing. The broker may deliver an event multiple times. The consumer may apply it multiple times. The destination is shaped so that the second application has no effect (UPSERT on a primary key, dedup table keyed on event ID, monotonic counter that ignores duplicates). The end-state property is the same as exactly-once. The implementation is fundamentally different. The implementation difference matters because effectively-once requires the destination to participate; exactly-once requires the engine to participate. The honest characterization for most production deployments is 'effectively-once,' not the stricter engine-level claim.

Exactly-OnceEffectively-OnceAt-Least-Once

Exactly-Once

Engine-controlled

Two-phase commit between the engine and the sink. Bounded to systems that participate in the transaction. The engine bears the complexity.

Effectively-Once

Consumer-controlled

At-least-once plus idempotent processing on the consumer side. Works across system boundaries because the destination owns deduplication.

At-Least-Once

Broker-controlled

Default for Kafka, Kinesis, Pub/Sub. Guarantees no event loss. Duplicates are possible and must be handled downstream.

Reading Vendor Claims

When a vendor advertises exactly-once, the right question is 'across which boundary.' Kafka exactly-once is exactly-once across Kafka topics within a Kafka transaction; it does not cover writes to a non-Kafka sink. Flink exactly-once is exactly-once for the engine's state; it covers external sinks only when those sinks are configured to participate in two-phase commit, which is a non-trivial configuration that many teams skip. Spark Structured Streaming exactly-once depends on the sink supporting idempotent writes, typically via a checkpoint-aware connector. The marketing word is the same; the engineering reality is per-deployment.

Questions to ask of an exactly-once claim:

▸Across which boundaries does the guarantee hold?
▸What sinks are supported, and is the production sink one of them?
▸What configuration is required to enable the guarantee, and is it on?
▸What is the latency cost (two-phase commit adds 100ms or more per checkpoint)?
▸What is the failure mode if the guarantee is violated; is it visible or silent?

When Effectively-Once Is the Right Choice

Effectively-once is often the better engineering choice even when exactly-once is available. The complexity is lower because the destination owns deduplication. The latency is lower because there is no two-phase commit handshake. The system is more portable because it does not depend on engine-specific transaction semantics. The cost is the engineering work to make destinations idempotent, but that work is local to the destination and reusable across producers. A system with idempotent destinations and at-least-once delivery is operationally simpler than a system with exactly-once delivery to mutable destinations.

	MERGE INTO daily_revenue AS target
	USING (
	SELECT event_id, revenue_date, merchant_id, amount
	FROM stream_buffer
	) AS source
	ON target.event_id = source.event_id
	WHEN NOT MATCHED THEN INSERT (event_id, revenue_date, merchant_id, amount
	) VALUES (source.event_id, source.revenue_date, source.merchant_id, source.amount
	)
	/* An idempotent sink: UPSERT on the event_id */
	/* Combined with at-least-once delivery, produces effectively-once */

The MERGE inserts each event_id once. A duplicate delivery of the same event_id matches and does nothing. The destination is idempotent, the broker is at-least-once, and the system as a whole is effectively-once. No two-phase commit is required. The engineering investment is in maintaining the dedup index on event_id, which is small relative to the operational simplification.

•Exactly-Once via 2PC

Engine-level guarantee tied to specific connectors
100ms or more added latency per checkpoint
Configuration is fragile; downgrades are possible without alerts
Bounded to systems that support the engine's transaction protocol

✓Effectively-Once via Idempotent Sink

Destination owns deduplication via UPSERT or dedup keys
No transaction overhead; latency unchanged
Broker delivers at-least-once; configuration is the default
Works across any combination of producer and consumer systems

Exactly-once is a property of a closed system. Effectively-once is a property of an open system that has been engineered to absorb duplicates. The honest characterization of most production deployments labeled exactly-once is closer to 'effectively-once,' and naming it precisely makes the failure modes legible.

✓Do

Default to at-least-once delivery plus idempotent sinks for cross-system pipelines
Reserve engine-level exactly-once for closed-system patterns where the engine owns both source and sink
Audit exactly-once claims against the actual sink configuration in production

✗Don't

Believe the marketing word without naming the boundary it applies across
Configure two-phase commit on a sink that does not support it; the guarantee silently downgrades
Skip dedup logic on the destination because the broker promises exactly-once; downstream changes invalidate the assumption

2PC, Outbox, Idempotent Consumers

Daily Life

Interviews

Apply two-phase commit, transactional outbox, and idempotent consumers as the building blocks for streaming idempotency, and combine them across the boundaries of a real architecture.

Three patterns recur in streaming idempotency engineering: two-phase commit, transactional outbox, and idempotent consumers. Each addresses a specific source of duplicates. Each has a cost that constrains where it applies. A senior engineer reaches for the right one without confusing them, because the pattern that solves consumer-side duplicates does not solve producer-side ones, and vice versa. Naming each precisely is the prerequisite for combining them correctly.

Two-Phase Commit Across Systems

Two-phase commit is the protocol for atomically committing a transaction across two or more systems. Phase one: the coordinator asks each participant to prepare and durably stage the change. Phase two: if all participants prepared successfully, the coordinator tells each to commit; otherwise it tells each to abort. The protocol guarantees that all participants reach the same outcome, all committing or all aborting. Flink and Kafka Streams use two-phase commit between the engine's checkpoint and the sink, which is how engine-level exactly-once is implemented when it works.

Phase 1(PREPARE) : Coordinator -> Sink : 'Stage these writes; respond OK or ABORT' Sink -> Coordinator : 'OK'(writes are durable but invisible) Phase 2(COMMIT OR ABORT) : IF ALL OK : Coordinator -> Sink : 'Commit; make the writes visible' Otherwise : Coordinator -> Sink : 'Abort; discard the staged writes' The coordinator 's decision is durable;
a crash mid-phase-2 is recoverable because the decision was logged.'

Where Two-Phase Commit Breaks Down

The protocol assumes every participant supports the prepare-then-commit handshake and that the coordinator can durably log its decision. Modern systems often satisfy both: Kafka and Flink both implement participants for the protocol, and the coordinator's log lives in a fault-tolerant state store. Older systems often do not. An external HTTP API with no transaction concept cannot participate. A NoSQL store without transaction support cannot participate. When any participant cannot prepare-then-commit, two-phase commit is unavailable and the architecture falls back to either effectively-once (idempotent consumer plus at-least-once) or weaker guarantees.

The Transactional Outbox Pattern

The transactional outbox addresses the dual-write problem: an application that needs to write to a database and publish an event to a queue cannot do both atomically because there is no transaction that spans the two. The pattern is to write only to the database. The transaction includes a row in an outbox table alongside the business write. A separate process tails the outbox table and publishes its rows to the queue. The application gets atomicity (the database's transaction); the queue gets delivery (the outbox tailer). Crashes between the two stages are recoverable because the outbox row is durable, and the tailer can retry until the publish succeeds.

	/* Transactional outbox: business write and outbox write in one transaction */
	/* A separate publisher tails the outbox and emits to Kafka at-least-once */
	SELECT
	event_id,
	aggregate_id,
	event_type,
	payload,
	published_at IS NULL AS pending
	FROM outbox
	WHERE published_at IS NULL
	ORDER BY created_at
	LIMIT 10

The pattern guarantees that an event is published if and only if the database write committed. It does not guarantee that the event is published exactly once; the tailer is at-least-once, so consumers must dedup on event_id. The pattern combines with idempotent consumers to produce effectively-once across the application boundary, which is the strongest end-to-end property typically achievable in event-driven systems.

The Idempotent Consumer

An idempotent consumer is one that produces the same end state regardless of how many times it sees the same event. The implementation is usually a deduplication step keyed on a producer-stamped event_id. Before processing, the consumer checks whether the event_id has been seen; if so, it skips. After processing, the consumer records the event_id as processed. The dedup state lives in a fast key-value store (Redis, DynamoDB) or in the destination database itself. The consumer is now safe under at-least-once delivery: duplicates are recognized and dropped.

	# An idempotent consumer
	def handle(event):
	if dedup_store.has_seen(event.event_id):
	return # Already processed; safe to skip

	with destination.transaction():
	apply_event(event)
	dedup_store.mark_seen(event.event_id, ttl_days=7)

	# The transaction ensures that apply_event and mark_seen happen together.
	# A crash between them retries the event; the second attempt finds
	# nothing in dedup_store and processes again, which is safe because
	# apply_event itself is idempotent on the destination key.

How the Three Patterns Combine

Pattern	Solves	Cost
Two-phase commit	Engine-to-sink atomicity in a closed system	Latency (100ms+ per checkpoint); requires participant support
Transactional outbox	Atomicity between an application's database write and an event publish	Outbox table, tailer process, dedup-on-event-id downstream
Idempotent consumer	Duplicates from at-least-once delivery on the consumer side	Dedup state store; lookup cost per event

A System That Uses All Three

A real high-stakes streaming system often uses all three patterns. The application uses transactional outbox to publish events atomically with its database writes. The streaming engine uses two-phase commit to write to its internal state and a Kafka sink atomically. The downstream consumer uses idempotent consumption keyed on event_id to absorb any duplicates that escape either pattern. The combination is effectively-once at every boundary, which is the strongest guarantee a multi-system pipeline can practically offer. No single pattern is sufficient; the combination is what produces the system-level property.

•Naive Dual-Write

App writes to database AND publishes to Kafka in same handler
No atomicity; crash between writes leaks one or the other
Duplicates and lost events both possible
The single most common production bug in event-driven systems

✓Outbox + Idempotent Consumer

App writes to database; outbox row is part of the same transaction
Tailer publishes outbox rows to Kafka with at-least-once
Downstream consumer dedups on event_id
Effectively-once across the application boundary

Pattern selection rule:

▸Atomic engine-to-sink in a closed system: two-phase commit
▸Atomic application-database-to-queue in any system: transactional outbox
▸Tolerating duplicates on the consumer side: idempotent consumer
▸All three may apply in different parts of the same architecture

✓Do

Use the transactional outbox for any service that writes to its database and publishes events
Implement idempotent consumers on every cross-system event handler; the broker is at-least-once by default
Reserve two-phase commit for engines that natively support it; do not roll a custom 2PC

✗Don't

Use the dual-write anti-pattern; it breaks under any failure that crosses the two writes
Assume Kafka exactly-once covers writes to a non-Kafka sink; it does not
Skip the dedup state TTL; an unbounded dedup store eventually becomes the bottleneck

TIP

When a system claims exactly-once, identify which of the three patterns is providing it. If the answer is none of them, the claim is a misnomer and the system is silently at-least-once.

Source

source

Spark job

job

MERGE / overwrite by key

target

Idempotency = replace, do not append. Write by MERGE or partition-overwrite keyed on a business key, so re-running the job (a retry or a backfill) produces the same rows instead of doubling them.

Replay Infrastructure

Daily Life

Interviews

Design replay infrastructure that combines retained sources, addressable positions, and idempotent downstreams, and reason about the cost of each.

Replay is the streaming-world equivalent of backfill. It is the act of reprocessing events from a known offset or timestamp to correct downstream state. Replay is harder than batch backfill because there is no clean partition to overwrite, and easier because the source is often retained in a log that supports random access. Designing for replay requires three pieces of infrastructure: a retained source, addressable positions, and idempotent downstream consumers. Without all three, replay is a manual recovery operation. With them, replay is a feature.

What Replay Is For

Replay Trigger	Frequency	Typical Range
Bug in the consumer logic; reprocess to correct downstream state	Several times per year per active stream	Hours to weeks
New consumer being onboarded; needs history	Per consumer launch	Whatever retention allows; typically days to months
Source schema correction; downstream needs to reprocess	Occasional	Days to weeks
Disaster recovery; a downstream system needs to be rebuilt	Rare but inevitable	Full retained history

The Source Retention Requirement

Replay assumes the source still has the events to replay. Kafka retains messages by time (default seven days, often configured to thirty or more) or by total size. Kinesis retains messages for up to three hundred sixty five days. Pub/Sub has a configurable retention window. The retention window is the limit of how far back replay can reach. A pipeline that needs to support six-month replays must run on infrastructure with six-month retention. Retention is not free; it costs storage proportional to throughput times retention. The cost is real but usually justified by the operational value of replay being a feature rather than an emergency.

Addressable Positions

Replay requires the ability to start a new consumer (or rewind an existing one) to a specific position in the source. Kafka exposes consumer offsets per partition; replaying from offset N on partition P is a documented operation. Iceberg, Delta Lake, and Hudi expose snapshot IDs and timestamps as addressable points in their tables; reading from snapshot S provides a stable point-in-time view. The position can be a numeric offset, a timestamp, a transaction LSN, or a snapshot ID, but it has to exist as a first-class concept the source supports. Sources that do not expose addressable positions cannot be replayed in any meaningful sense.

	# Replay from a known Kafka offset
	from kafka import KafkaConsumer, TopicPartition

	consumer = KafkaConsumer(
	bootstrap_servers='kafka:9092',
	group_id='replay-2026-04-25', # New group ID prevents committing over production offsets
	enable_auto_commit=False,
	)

	partition = TopicPartition('orders', 0)
	consumer.assign([partition])
	consumer.seek(partition, 1450000) # Resume from the offset where the bug started

	for event in consumer:
	handle_idempotently(event)
	if event.offset >= 1500000:
	break # Replay window complete

The replay reads from offset 1,450,000 to 1,500,000, applies idempotent processing, and stops. The consumer group ID is new, so the production consumer's offsets are unaffected. The downstream destination dedups on event_id, so the replay does not produce double-applied effects. The whole operation is bounded, isolated, and reversible if it goes wrong, which is what makes it operationally safe.

Table Format Time Travel

Modern table formats (Iceberg, Delta Lake, Hudi) provide time travel as a first-class feature. Each commit to the table produces a snapshot identified by an ID and a timestamp. A query can address any retained snapshot directly. Replay against a table-format source becomes 'read the table as of snapshot S' or 'read the changes between snapshots S1 and S2'; the reads are stable, addressable, and free of the eventual-consistency artifacts that plague raw object storage. Time travel turns replay from a streaming-only concept into one that applies to any data lake table that uses a modern format.

	SELECT
	order_id,
	amount,
	status
	FROM orders FOR TIMESTAMP AS OF '2026-04-25 12:00:00'
	WHERE order_date = '2026-04-25'

The query reads the orders table as it existed at noon on April 25, regardless of how the table has changed since. Snapshot retention determines how far back the query can reach. The mechanism is the same as Kafka offsets in spirit: an addressable position the source exposes as a first-class operation. Time travel makes batch replay against table-format sources as ergonomic as streaming replay against Kafka.

Idempotent Downstreams: The Third Requirement

A retained source and addressable positions are useless if the downstream is not idempotent. Replay re-emits events the downstream has already processed, and a non-idempotent downstream double-counts every replayed event. The third requirement is that every consumer in the pipeline can absorb replays without producing duplicate effects. UPSERT-keyed destinations satisfy this naturally. Aggregations stored as full overwrites by partition satisfy it. Side effects (sending an email, charging a card, calling an external API) generally do not, and replay through those steps requires either skipping them or guarding them with their own dedup logic.

The replay infrastructure checklist:

▸Source retains history long enough to cover the longest expected replay (Kafka retention, table format snapshot retention)
▸Source exposes addressable positions (Kafka offsets, snapshot IDs, timestamps)
▸Every downstream consumer is idempotent on the event ID or the business key
▸Replay can run on isolated compute that does not interfere with production processing
▸Side effects (emails, charges, external API calls) are guarded against re-emission during replay

•No Replay Infrastructure

Source retention is whatever the broker defaults to
No way to start a consumer at a specific past position
Downstreams accumulate state non-idempotently
A bug-fix replay is a multi-week reconstruction project

✓Replay-Ready Architecture

Source retention is sized for the longest expected replay window
Consumers can be started at any retained offset or snapshot
Downstreams are idempotent; replay produces correct end state
A bug-fix replay is a one-day operation with bounded impact

Retained SourceAddressable PositionIdempotent Downstream

Retained Source

History to read from

Kafka topic retention; Iceberg/Delta snapshot retention. Sized to the longest expected bug-detection-to-replay window.

Addressable Position

Where to resume

Kafka offsets, Iceberg snapshot IDs, Delta version numbers. The source exposes the position as a first-class concept.

Idempotent Downstream

Safe to replay into

UPSERT keyed on event_id, partition overwrite by date, dedup table. The destination absorbs replays without producing duplicate effects.

✓Do

Size source retention to cover the longest realistic bug-detection-to-replay window, not the median
Use new consumer group IDs for replay so production offsets are not disturbed
Run replay on isolated compute capacity to protect production throughput

✗Don't

Replay through a pipeline whose downstream side effects are not idempotent; the email storm is not theoretical
Replay over a window large enough to cause source-side rate limits or back-pressure on the broker
Treat replay as a rare operation; design for it as a feature, not as an emergency

TIP

When a streaming system has no replay story, the on-call rotation is one bug away from a multi-week recovery. Investing in retention and idempotent downstreams early is cheaper than the first incident that exposes their absence.

Two Streaming Aggregators

Daily Life

Interviews

Design two streaming aggregators with opposite guarantee requirements, justify each architectural choice against the use case, and articulate when exactly-once is over-engineering.

The patterns become concrete on real workloads. Two streaming aggregators sit at opposite ends of the idempotency-cost spectrum. The first is a financial close aggregator that produces daily revenue numbers used in regulatory reporting; exactly-once is a correctness requirement, and the cost of getting it wrong is real money and real regulatory exposure. The second is a page view counter that powers a real-time engagement dashboard; at-least-once is sufficient, the dashboard tolerates noise, and the cost of full exactly-once would be wasted. The same architectural team would build the two systems differently. The exercise below walks through both.

Aggregator A: Financial Close (Exactly-Once Matters)

The system aggregates payment events from a Kafka topic into a daily revenue table that finance uses for the monthly close. Outputs go into a regulatory report that is filed with the SEC. Duplicates are not noise; they are misstated revenue, which carries legal liability. The system must produce identical output across replays and recoveries. The architecture is engineered top to bottom for exactly-once at the system level.

Component	Choice	Why
Source	Kafka with 90-day retention	Replay window must cover full quarter detection-to-correction cycle
Engine	Flink with checkpointed state and two-phase commit sink	Native exactly-once across engine state and the destination
Destination	Iceberg table with primary key on event_id	Idempotent UPSERT absorbs any duplicates that escape 2PC
Downstream	Daily snapshot report keyed on revenue_date	Reports are reproducible from the table at any past snapshot
Replay	Bounded window with new consumer group; idempotent destination dedupes	Bug-fix replays produce identical output to original runs

/ / Flink two - phase commit sink for the financial close aggregator / /(sketch ; production code uses Iceberg 's exactly-once sink)
DataStream<RevenueEvent> revenue = env
    .addSource(new FlinkKafkaConsumer<>(
        "payments",
        new RevenueDeserializer(),
        kafkaProps))
    .keyBy(RevenueEvent::getMerchantId)
    .window(TumblingEventTimeWindows.of(Time.days(1)))
    .aggregate(new RevenueAggregator());

revenue.addSink(IcebergSink.forRowData(rowData)
    .table(loader)
    .upsert(true)              // Idempotent UPSERT on primary key
    .tableSchema(schema)
    .toTwoPhaseCommitSink());  // Two-phase commit between checkpoint and sink'

The two-phase commit binds the engine's checkpoint to the sink's commit. A failure mid-checkpoint rolls back any uncommitted writes. The Iceberg destination is also idempotent on event_id, which provides defense in depth: even if the two-phase commit fails in some unforeseen way, duplicates are absorbed by the UPSERT. The retention window is generous because financial close bugs are sometimes detected weeks or months after the fact, and the replay window must cover detection-to-correction.

Aggregator B: Page View Counter (At-Least-Once Is Fine)

The system aggregates page view events into a per-page counter shown on a real-time engagement dashboard. The dashboard refreshes every five seconds. A few duplicate counts during a Kafka rebalance produce a few extra page views in the count for a few seconds. The dashboard's audience is product managers and growth analysts who care about trends, not absolute precision. The architecture is engineered for throughput and operational simplicity, not for engine-level exactly-once.

Component	Choice	Why
Source	Kafka with 7-day retention	Replay over a few days is the realistic upper bound; longer is overkill
Engine	Kafka Streams with at-least-once and idempotent counter	Operational simplicity; counters are inherently noisy on the millisecond scale
Destination	Redis counter keyed by (page_id, hour_bucket)	INCR is fast; small over-counts during failures are acceptable
Downstream	Dashboard reading the Redis counter every 5 seconds	Display layer; not used for accounting or billing
Replay	Not engineered; if the counter drifts, recompute from raw events as a batch job	Cost of full replay infrastructure exceeds the value of perfect counts

	# Page view counter consumer
	# At-least-once delivery; small over-counts are tolerable
	for event in consumer:
	bucket = event.timestamp.replace(minute=0, second=0, microsecond=0)
	redis.hincrby(f"pageviews:{bucket.isoformat()}", event.page_id, 1)
	consumer.commit(event.offset)

	# A consumer crash between hincrby and commit may double-count one event.
	# The dashboard refreshes every 5 seconds; the over-count is invisible.
	# A daily batch job recomputes from raw events to correct any drift.

The page view counter is intentionally simple. The INCR operation is non-idempotent in the strict sense: a duplicate event increments the counter twice. The system absorbs the imprecision because the use case tolerates it. A daily batch reconciliation pass replaces the streaming counter with the correct count from the immutable raw events; small streaming-side drifts are corrected in batch without the streaming pipeline needing exactly-once semantics. The pattern is sometimes called a Lambda architecture for this specific reason: streaming for low latency, batch for correctness, both writing to the same dashboard with batch winning at the daily boundary.

✓Aggregator A: Financial Close

Exactly-once at the system level is a hard requirement
Two-phase commit between Flink and Iceberg
90-day retention; bug-fix replays must reach back a quarter
UPSERT keyed on event_id provides defense in depth
Cost of architectural complexity is justified by regulatory exposure

✓Aggregator B: Page View Counter

At-least-once with bounded over-counting is acceptable
Kafka Streams with INCR to Redis counter
7-day retention; dashboard never looks back further
Daily batch reconciliation corrects any streaming drift
Cost of architectural simplicity is justified by use case tolerance

The Decision Framework

The choice between the two architectures is not a tools decision; it is a use-case decision. Three questions drive it. First, what is the cost of a duplicate effect? In financial close, the cost is regulatory exposure and reissued reports. In a page view counter, the cost is a few percent noise on a dashboard. Second, what is the latency budget? Two-phase commit adds 100ms or more per checkpoint, which a financial close pipeline can absorb but a real-time dashboard cannot. Third, what is the engineering cost ceiling? Exactly-once architectures take real engineering investment to design, build, and operate; the investment pays back only when the use case demands it.

When exactly-once at the system level is the right choice:

▸Outputs are used for accounting, billing, or regulatory reporting
▸Duplicates produce real money or real legal exposure, not just noise
▸Latency budget can absorb 2PC overhead (typically 100ms+ per checkpoint)
▸Team has the operational maturity to run a 2PC system in production

When at-least-once is the right choice:

▸Outputs power dashboards, analytics, or anomaly detection where small noise is fine
▸Bounded over-counting is correctable via a daily batch reconciliation pass
▸Latency budget is tight (sub-100ms end-to-end)
▸Team prefers operational simplicity over engine-level guarantees

The Closing Principle

Idempotency in streaming is not one technique; it is a portfolio of techniques chosen against the use case. The financial close aggregator and the page view counter share zero infrastructure choices and yet are both correct architectures for their respective problems. The senior engineering move is not to apply exactly-once everywhere reflexively; it is to ask which guarantee the use case actually requires and to invest the architectural complexity proportional to the answer. Over-engineering is its own bug, and exactly-once where at-least-once would suffice is one of its more expensive forms.

Match the guarantee to the cost of a duplicate effect; regulatory exposure earns 2PC, dashboard noise does not.

Two-phase commit adds latency and operational complexity; reserve it for use cases that need it.

Lambda-style batch reconciliation lets streaming approximations live alongside batch correctness on the same dashboard.

The concise statement of the principle: pick the weakest guarantee that the use case can tolerate, then engineer the system to deliver it reliably. Stronger guarantees are not free; they trade architectural complexity for properties the use case may not need.

✓Do

Match the idempotency guarantee to the use case; do not apply exactly-once where at-least-once suffices
Use Lambda-style batch reconciliation when streaming approximations are acceptable in the short window
Document the chosen guarantee in the pipeline contract; future maintainers need to know what was promised

✗Don't

Default to exactly-once everywhere; the operational cost is real
Believe at-least-once is fine when the downstream actually requires precision
Mix guarantees within a single pipeline without explicit boundary documentation; the chain is as strong as the weakest link

TIP

When designing a new streaming pipeline, write the guarantee requirement in the contract before choosing tools. The requirement determines the tool, not the other way around.

❯❯❯PUTTING IT ALL TOGETHER

> A payment processor at scale runs both kinds of streaming aggregators in production. The financial close aggregator was blamed last week for a duplicate-revenue incident during a Kafka rebalance; regulatory reports for two days had to be reissued. The page view counter has been running with at-least-once for two years without incident. The principal engineer is asked: 'How should the next generation of streaming pipelines at this company be designed so the financial close architecture is as solid as the page view counter is operable, without over-engineering either one?'

Step one: name the property each pipeline requires. The financial close needs effectively-once at the system level; the page view counter needs at-least-once with bounded drift. The naming determines every downstream choice. (Builds on the four-roles model from Lesson 1: source, transform, storage, consumer.)

Step two: design the financial close around three patterns: transactional outbox at the producer, two-phase commit from Flink to Iceberg as the engine-to-sink boundary, and idempotent UPSERT on event_id as defense in depth. The combination produces effectively-once across every boundary, which is the strongest practical end-to-end property.

Step three: design the page view counter for operational simplicity. Kafka Streams at-least-once into a Redis counter, with a daily batch reconciliation pass that recomputes from raw events. The architecture borrows from the layered shape (raw layer plus serving layer) introduced in Lesson 1 intermediate; streaming and batch coexist on the same source of truth.

Step four: invest in replay infrastructure regardless of the chosen guarantee. 90-day Kafka retention for the financial close, 7-day for the page view counter; addressable positions via offsets and Iceberg snapshots; idempotent downstreams keyed on event_id. Replay turns from an emergency operation into a feature. (Extends the orchestrator backfill pattern from Lesson 4 advanced into the streaming domain.)

Step five: store both pipelines in lakehouse storage with time travel (Lesson 3 advanced). The financial close uses Iceberg snapshots for reproducible quarterly reports; the page view counter uses Iceberg time travel as the canonical record against which the streaming approximation is reconciled. Storage choice is the foundation that makes replay and reconciliation work.

Step six: document each pipeline's guarantee in a contract that names producer, consumer, freshness SLA, idempotency strategy, and replay window. The contract converts the architectural decision into a binding artifact downstream consumers can rely on. (Extends the partition-overwrite and DELETE-then-INSERT idempotency contracts from Lesson 5 beginner and intermediate.)

KEY TAKEAWAYS

Streaming has no clean partition boundary: idempotency is engineered via two-phase commit, transactional outbox, and idempotent consumers, not inherited from a partition overwrite.

Exactly-once is closed-system; effectively-once is open-system: engine-level guarantees stop at the boundary the engine controls; cross-system guarantees come from idempotent destinations plus at-least-once delivery.

The transactional outbox solves the dual-write problem: applications write to a database and an outbox in one transaction; a tailer publishes outbox rows to the queue. Combined with idempotent consumers it produces effectively-once across the application boundary.

Replay infrastructure has three components: retained source (Kafka retention, table format snapshots), addressable positions (offsets, snapshot IDs), and idempotent downstreams. All three are required for replay to be a feature rather than an emergency.

Match the guarantee to the use case: exactly-once for financial close and regulatory reporting; at-least-once with batch reconciliation for dashboards and operational counters. Over-engineering is its own bug.

Streaming idempotency, exactly-once claims, and replay infrastructure separate marketing from engineering

Category: Pipeline Architecture
Difficulty: advanced
Duration: 38 minutes
Challenges: 0 hands-on challenges

Topics covered: Idempotency in Streaming Is Harder, Exactly-Once vs Effectively-Once, 2PC, Outbox, Idempotent Consumers, Replay Infrastructure, Two Streaming Aggregators

Lesson Sections

Idempotency in Streaming Is Harder (concepts: paIdempotency)
Batch idempotency rests on a clean boundary: the partition. The pipeline owns a unit of work, the unit corresponds to a slice of the destination, and the slice can be replaced atomically. Streaming has no equivalent. Events arrive continuously; the destination is being written to continuously; there is no obvious moment at which to draw a boundary and say 'the work for this window is now complete and can be replaced.' Streaming idempotency exists, but it is engineered, not inherent, and the engi
Exactly-Once vs Effectively-Once (concepts: paIdempotency)
Exactly-once is one of the most loaded phrases in streaming. Vendor marketing has used it for so long that the engineering meaning has eroded. The honest framing: exactly-once is achievable inside a closed system where the engine controls every read, write, and offset commit. End-to-end exactly-once across systems is generally not achievable; what gets advertised under that name is more precisely called effectively-once, which is at-least-once delivery combined with idempotent consumers. The dis
2PC, Outbox, Idempotent Consumers (concepts: paIdempotency)
Three patterns recur in streaming idempotency engineering: two-phase commit, transactional outbox, and idempotent consumers. Each addresses a specific source of duplicates. Each has a cost that constrains where it applies. A senior engineer reaches for the right one without confusing them, because the pattern that solves consumer-side duplicates does not solve producer-side ones, and vice versa. Naming each precisely is the prerequisite for combining them correctly. Two-Phase Commit Across Syste
Replay Infrastructure (concepts: paStreamProcessing)
Replay is the streaming-world equivalent of backfill. It is the act of reprocessing events from a known offset or timestamp to correct downstream state. Replay is harder than batch backfill because there is no clean partition to overwrite, and easier because the source is often retained in a log that supports random access. Designing for replay requires three pieces of infrastructure: a retained source, addressable positions, and idempotent downstream consumers. Without all three, replay is a ma
Two Streaming Aggregators (concepts: paIdempotency)
The patterns become concrete on real workloads. Two streaming aggregators sit at opposite ends of the idempotency-cost spectrum. The first is a financial close aggregator that produces daily revenue numbers used in regulatory reporting; exactly-once is a correctness requirement, and the cost of getting it wrong is real money and real regulatory exposure. The second is a page view counter that powers a real-time engagement dashboard; at-least-once is sufficient, the dashboard tolerates noise, and