Loading lesson...
Idempotency and Backfill: Intermediate
Idempotent writes, business keys, and explicit time bounds turn pipelines into replayable systems
Idempotent writes, business keys, and explicit time bounds turn pipelines into replayable systems
- Category
- Pipeline Architecture
- Difficulty
- intermediate
- Duration
- 32 minutes
- Challenges
- 0 hands-on challenges
Topics covered: Three Idempotent Write Patterns, Choosing a Business Key, Reads Can Be Non-Idempotent Too, The Backfill Operation, Refactor: From ETL to Idempotent
Lesson Sections
- Three Idempotent Write Patterns (concepts: paIdempotentWrites, paPartitionOverwrite)
Three write patterns cover the vast majority of idempotent batch pipelines. Partition overwrite replaces a slice of the destination table identified by a partition key. MERGE matches incoming rows to existing rows by a business key and updates or inserts as appropriate. DELETE-then-INSERT inside a transaction clears a logical slice and writes its replacement atomically. Each pattern has a niche, and a senior engineer reaches for the right one without thinking. Reaching for the wrong one produces
- Choosing a Business Key (concepts: paBusinessKey, paSurrogateKey)
MERGE and deduplication both depend on a key that uniquely identifies each row. The right key seems obvious until the pipeline ingests its first edge case: an order with a null user_id, two events with the same timestamp, a CDC stream that emits one event for the row before the change and another for after, sharing the same primary key, a vendor that recycles IDs after a long enough interval, a soft delete that resurfaces the same logical row weeks later. Picking the wrong key turns an idempoten
- Reads Can Be Non-Idempotent Too (concepts: paExplicitTimeBounds, paEventVsProcessingTime)
Most discussions of idempotency focus on writes. The hidden second half is reads. A pipeline that reads non-deterministic input cannot be idempotent in any useful sense, because the same logical run produces different output on different days. The most common offenders are SELECT NOW(), CURRENT_DATE, and any function that resolves to wall-clock time inside a transform. Other offenders include random number generators without seeds, environment variables that drift between runs, and any external
- The Backfill Operation (concepts: paBackfill, paFullVsIncremental)
Backfill is the act of running a pipeline over historical date ranges, usually to fix data that was wrong or to populate a new pipeline with history it was not built to capture in real time. Backfill is the operational payoff of idempotency. A pipeline that is idempotent supports backfill almost for free: pass a different date range, run the pipeline, get the right answer. A pipeline that is not idempotent does not support backfill at all; running it on a historical date corrupts whatever data i
- Refactor: From ETL to Idempotent (concepts: paIdempotentRefactor, paBackfill)
The patterns are clearer when applied to a real refactor. The pipeline below is a real-shaped daily ETL that ingests payments from a Stripe-like API, joins them to customer accounts, and writes a daily payments fact table. The original version was written in a hurry and has every common idempotency bug at once. The refactored version applies the three patterns above: partition keys, MERGE on a business key, and explicit time bounds. The diff is the worked example. Refactors of this shape are com