Loading lesson...

Idempotency and Backfill: Advanced

Streaming idempotency, exactly-once claims, and replay infrastructure separate marketing from engineering

Streaming idempotency, exactly-once claims, and replay infrastructure separate marketing from engineering

Category
Pipeline Architecture
Difficulty
advanced
Duration
38 minutes
Challenges
0 hands-on challenges

Topics covered: Idempotency in Streaming Is Harder, Exactly-Once vs Effectively-Once, 2PC, Outbox, Idempotent Consumers, Replay Infrastructure, Two Streaming Aggregators

Lesson Sections

  1. Idempotency in Streaming Is Harder (concepts: paStreamingIdempotency, paAtLeastOnce)

    Batch idempotency rests on a clean boundary: the partition. The pipeline owns a unit of work, the unit corresponds to a slice of the destination, and the slice can be replaced atomically. Streaming has no equivalent. Events arrive continuously; the destination is being written to continuously; there is no obvious moment at which to draw a boundary and say 'the work for this window is now complete and can be replaced.' Streaming idempotency exists, but it is engineered, not inherent, and the engi

  2. Exactly-Once vs Effectively-Once (concepts: paExactlyOnce, paEffectivelyOnce)

    Exactly-once is one of the most loaded phrases in streaming. Vendor marketing has used it for so long that the engineering meaning has eroded. The honest framing: exactly-once is achievable inside a closed system where the engine controls every read, write, and offset commit. End-to-end exactly-once across systems is generally not achievable; what gets advertised under that name is more precisely called effectively-once, which is at-least-once delivery combined with idempotent consumers. The dis

  3. 2PC, Outbox, Idempotent Consumers (concepts: paTwoPhaseCommit, paTransactionalOutbox, paIdempotentConsumer)

    Three patterns recur in streaming idempotency engineering: two-phase commit, transactional outbox, and idempotent consumers. Each addresses a specific source of duplicates. Each has a cost that constrains where it applies. A senior engineer reaches for the right one without confusing them, because the pattern that solves consumer-side duplicates does not solve producer-side ones, and vice versa. Naming each precisely is the prerequisite for combining them correctly. Two-Phase Commit Across Syste

  4. Replay Infrastructure (concepts: paReplay, paTimeTravel)

    Replay is the streaming-world equivalent of backfill. It is the act of reprocessing events from a known offset or timestamp to correct downstream state. Replay is harder than batch backfill because there is no clean partition to overwrite, and easier because the source is often retained in a log that supports random access. Designing for replay requires three pieces of infrastructure: a retained source, addressable positions, and idempotent downstream consumers. Without all three, replay is a ma

  5. Two Streaming Aggregators (concepts: paStreamingAggregatorDesign, paGuaranteeTradeoffs)

    The patterns become concrete on real workloads. Two streaming aggregators sit at opposite ends of the idempotency-cost spectrum. The first is a financial close aggregator that produces daily revenue numbers used in regulatory reporting; exactly-once is a correctness requirement, and the cost of getting it wrong is real money and real regulatory exposure. The second is a page view counter that powers a real-time engagement dashboard; at-least-once is sufficient, the dashboard tolerates noise, and