Loading lesson...

Idempotency and Backfill: Intermediate

Idempotent writes, business keys, and explicit time bounds turn pipelines into replayable systems

Idempotent writes, business keys, and explicit time bounds turn pipelines into replayable systems

Category
Pipeline Architecture
Difficulty
intermediate
Duration
32 minutes
Challenges
0 hands-on challenges

Topics covered: Three Idempotent Write Patterns, Choosing a Business Key, Reads Can Be Non-Idempotent Too, The Backfill Operation, Refactor: From ETL to Idempotent

Lesson Sections

  1. Three Idempotent Write Patterns (concepts: paIdempotentWrites, paPartitionOverwrite)

    Three write patterns cover the vast majority of idempotent batch pipelines. Partition overwrite replaces a slice of the destination table identified by a partition key. MERGE matches incoming rows to existing rows by a business key and updates or inserts as appropriate. DELETE-then-INSERT inside a transaction clears a logical slice and writes its replacement atomically. Each pattern has a niche, and a senior engineer reaches for the right one without thinking. Reaching for the wrong one produces

  2. Choosing a Business Key (concepts: paBusinessKey, paSurrogateKey)

    MERGE and deduplication both depend on a key that uniquely identifies each row. The right key seems obvious until the pipeline ingests its first edge case: an order with a null user_id, two events with the same timestamp, a CDC stream that emits one event for the row before the change and another for after, sharing the same primary key, a vendor that recycles IDs after a long enough interval, a soft delete that resurfaces the same logical row weeks later. Picking the wrong key turns an idempoten

  3. Reads Can Be Non-Idempotent Too (concepts: paExplicitTimeBounds, paEventVsProcessingTime)

    Most discussions of idempotency focus on writes. The hidden second half is reads. A pipeline that reads non-deterministic input cannot be idempotent in any useful sense, because the same logical run produces different output on different days. The most common offenders are SELECT NOW(), CURRENT_DATE, and any function that resolves to wall-clock time inside a transform. Other offenders include random number generators without seeds, environment variables that drift between runs, and any external

  4. The Backfill Operation (concepts: paBackfill, paFullVsIncremental)

    Backfill is the act of running a pipeline over historical date ranges, usually to fix data that was wrong or to populate a new pipeline with history it was not built to capture in real time. Backfill is the operational payoff of idempotency. A pipeline that is idempotent supports backfill almost for free: pass a different date range, run the pipeline, get the right answer. A pipeline that is not idempotent does not support backfill at all; running it on a historical date corrupts whatever data i

  5. Refactor: From ETL to Idempotent (concepts: paIdempotentRefactor, paBackfill)

    The patterns are clearer when applied to a real refactor. The pipeline below is a real-shaped daily ETL that ingests payments from a Stripe-like API, joins them to customer accounts, and writes a daily payments fact table. The original version was written in a hurry and has every common idempotency bug at once. The refactored version applies the three patterns above: partition keys, MERGE on a business key, and explicit time bounds. The diff is the worked example. Refactors of this shape are com