Idempotency and Backfill: Beginner

An e-commerce data team at a Series C company ran a nightly job that copied the day's orders from Postgres into Snowflake. One Tuesday at 2:14am the job failed halfway through after writing 180,000 of 250,000 rows. The on-call engineer hit retry. The script ran again from the start, wrote all 250,000 rows, and went green. Wednesday morning the finance team noticed revenue was up 41 percent. The 180,000 rows from the first attempt were still there, sitting next to the 250,000 from the retry. Two months of monthly close reports had to be reissued. The fix took three lines of code. The reputational damage took a quarter to repair. This lesson is about the property that would have prevented the incident: idempotency. A pipeline is idempotent when running it twice produces the same result as running it once. Without that property, every retry is a roll of the dice.

The Retry That Doubled the Rows

Daily Life
Interviews

Recognize the duplicate-row failure mode that occurs when an append-only pipeline is retried after a partial failure.

Pipelines fail. Networks blink, instances die, upstream APIs return 500s, the warehouse runs out of memory, a credential expires, a Spark executor gets evicted from a spot pool, an S3 bucket policy changes overnight, a DNS record propagates slowly. Failure is not the exception in production data pipelines; it is the background hum. The right response to a failed run is almost always to run it again. The wrong response is to run it again on a pipeline that does not handle being run twice. The wrong response produces duplicate rows, broken aggregates, and revenue charts that lie. Every senior data engineer has at least one story about a duplicate-row incident that was traced back to a routine retry on a pipeline that was never built to be retried, and most have several.

The Anatomy of the Failure

StageWhat HappenedState After
First attempt startsJob extracts 250,000 rows from PostgresLocal file holds the day's orders
First attempt halfwayJob writes 180,000 rows to Snowflake then crashes on row 180,001Snowflake holds 180,000 partial rows; the run is marked failed
Retry triggeredOn-call hits retry; job reads 250,000 rows from Postgres againLocal file now holds the same 250,000 orders
Retry completesJob appends all 250,000 rows to Snowflake; run is marked successSnowflake holds 430,000 rows for a day that had only 250,000 orders
The numbers are the symptom. The cause is the word append. The job appended new rows on every run rather than replacing the rows for the day it was processing. Append plus retry equals duplicates. The append-only pattern is the single most common idempotency bug in production data pipelines, and it is invisible until the day a run fails partway through. Every other day the pipeline runs to completion on the first attempt, the append behaves correctly, and there is no signal that anything is wrong. The bug is latent. It activates only when failure injects a partial-run state into the destination, which is rare per pipeline but inevitable across a fleet of hundreds of pipelines running thousands of times per week.
Three things that make this kind of bug expensive:
  • The duplicates pass all schema checks because the rows are individually valid
  • Aggregates downstream silently double, but the dashboards keep rendering
  • Detection is delayed by hours or days because nothing fails loudly

Why It Looks Right

The retry succeeded. The dashboards loaded. The schema was valid. There was no error to investigate because the system did exactly what it was told to do, twice. Append-and-retry pipelines fail silently because the failure mode is not a crash; it is correct execution of an incorrect program. Detection requires a count check, a uniqueness check, or a finance team noticing that a number looks high. None of those happen automatically without explicit instrumentation, and that instrumentation rarely exists in pipelines built quickly. The pipeline has no way to know that the rows it inserted on attempt one are still there. The orchestrator has no way to know either, because the orchestrator's only signal is the exit code of the most recent attempt. The destination has no way to know because each row is individually well-formed. Three layers of the system, three sources of plausible-deniability silence, and the only entity left to detect the bug is a human who decides to look at a number and notice it is wrong. Many such humans look later than the duplicate-row arrival, sometimes by months.
1# The pipeline that produced the bug. Looks reasonable; is not idempotent.
2rows = postgres.fetch_orders_for_date(run_date)
3for row in rows:
4 snowflake.insert(table="orders", row=row)
5# If this script crashes after some rows are inserted and is restarted,
6# the surviving rows from attempt 1 are not removed before attempt 2 runs.
Three lines. No error handling. No deduplication. No notion of a partition. Every retry adds another full copy of the day's orders on top of whatever was already there. The script is not malicious or careless; it is a perfectly ordinary first draft that nobody questioned because it worked the first hundred times it ran. The hundred-and-first time is when the lesson lands. The variant of this script that exists in every data engineering organization at some point in its history is the same shape with different table names; the bug is invariant, not specific to any one company or stack.
Append-Only Pipeline
  • Each run adds rows on top of whatever is already there
  • A retry after a partial failure produces duplicates
  • Backfill of an old date duplicates that day's data
  • The bug is invisible until something downstream notices
Idempotent Pipeline
  • Each run produces the same end state regardless of how many times it is invoked
  • Retries are safe; the destination ends up correct
  • Backfills replay history without corrupting it
  • Partial failures are recoverable by rerunning the job

Most data quality incidents are not caused by bad data arriving from upstream. They are caused by correct data being processed incorrectly downstream. Idempotency is the most common failure mode in that category.

alert
Append-only writes plus retries equal duplicates.
check
The bug is silent because each row is individually valid.
query
The fix is structural: change how the pipeline writes, not how it retries.
AppendReplace
Append
Adds rows on every run
INSERT INTO. Each invocation grows the destination. A retry after partial failure produces duplicates.
Replace
Sets state on every run
INSERT OVERWRITE or CREATE OR REPLACE. The destination matches the SELECT exactly. Retries produce identical state.
TIP
Whenever a new pipeline is being designed, the first question is not 'does it work end to end' but 'what happens if it runs twice on the same day.' Answering that question up front is cheaper than answering it after a finance audit.

Idempotency in One Sentence

Daily Life
Interviews

Define idempotency in one sentence and apply the two-run test to any piece of pipeline code.

Idempotency is one of those words that sounds harder than the idea it names. The mathematical definition is short: an operation is idempotent if applying it twice gives the same result as applying it once. The data engineering definition is even shorter, because the operation in question is always 'run the pipeline.' Running an idempotent pipeline twice produces the same end state as running it once. That is the entire concept. The word entered software engineering through HTTP, where GET, PUT, and DELETE are idempotent and POST is not, and migrated into data engineering as data systems borrowed the same vocabulary. The borrowing is fitting: the same property that makes HTTP retries safe makes pipeline retries safe.

The Definition, with Examples

OperationIdempotent?Why
Setting a light switch to offYesOff after one flip; off after two flips. State is the same.
Pressing a doorbellNoOne press makes the bell ring once; two presses make it ring twice.
Locking a car with the remoteYesLocked after one click; locked after two. The end state is locked.
Adding sugar to coffeeNoOne spoonful is sweet; two spoonfuls are sweeter. State accumulates.
Setting a thermostat to 68 degreesYesTarget is 68 after one set; 68 after two. Outcome is identical.
The pattern is clear. Operations that set a state are idempotent. Operations that increment, accumulate, or add are not. The same distinction applies to pipelines. A pipeline that sets a partition to a known value is idempotent. A pipeline that adds rows to a table is not. Whether the operation sets or adds is the property that matters, not the language, framework, or runtime. A Python script using Snowflake, a Spark job using Iceberg, and a dbt model using BigQuery can all be idempotent, and all be non-idempotent, depending entirely on the shape of the write each one performs. The choice of tool has nothing to do with the property; the choice of write does.
IdempotentNon-Idempotent
Idempotent
Setting state
Running once or running ten times produces the same final state. Examples: overwriting today's partition, an UPDATE that matches by a key (the intermediate tier covers MERGE), SET a configuration to a value.
Non-Idempotent
Accumulating change
Each invocation adds to what was there. Examples: INSERT a row, increment a counter, append a file, send an email.

Why the Word Matters

Naming the property is the first step toward designing for it. A team that does not have the word will treat each instance of the bug as a one-off. The order pipeline doubled. The events pipeline doubled. The payments pipeline doubled. Three separate fixes, three separate retrospectives. A team that has the word recognizes that all three are the same bug and reaches for the same fix. The vocabulary collapses N problems into one, which is what most engineering vocabulary does when it is doing its job. Idempotency is one of those rare data engineering terms whose meaning is the same in software engineering, distributed systems, and mathematics, which means borrowing the discipline from those fields is easy and the precision is preserved.

The Pipeline Version of the Definition

An idempotent pipeline satisfies all three:
  • Running the pipeline once and running it three times produce the same final state
  • A partial failure followed by a retry leaves no trace of the failed attempt
  • A backfill of an already-processed date does not duplicate or corrupt that date's data

The Test

The simplest way to ask whether a pipeline is idempotent is to run it twice in a row on the same input. If the destination ends up identical to running it once, the pipeline is idempotent. If the destination ends up different (more rows, different aggregates, a sent email that arrived twice), the pipeline is not idempotent. The test is mechanical and can be applied to any piece of pipeline code in a few seconds. The discipline is applying it before the code ships rather than after the first incident.
1# Mental model for the idempotency test
2# Pipeline 1: idempotent set-state operation
3# Pipeline 2: non-idempotent accumulate operation
4
5def pipeline_set_state(destination, value):
6 destination['daily_total'] = value
7 return destination
8
9def pipeline_accumulate(destination, value):
10 destination['daily_total'] = destination.get('daily_total', 0) + value
11 return destination
12
13# Run pipeline 1 three times with the same input
14d1 = {}
15for _ in range(3):
16 pipeline_set_state(d1, 100)
17print('Set-state, 3 runs:', d1)
18
19# Run pipeline 2 three times with the same input
20d2 = {}
21for _ in range(3):
22 pipeline_accumulate(d2, 100)
23print('Accumulate, 3 runs:', d2)

Idempotency is a property of the write, not a property added on top. The choice between INSERT, MERGE, and partition overwrite determines the property. Wrapping a non-idempotent write in retries does not make it idempotent.

TIP
Ask 'what happens if this runs twice' before writing any pipeline code. The answer is one of two things: 'the same result' or 'a bug nobody will catch for a month.' The asking is free; the catching is expensive.

Replace, Do Not Append

Daily Life
Interviews

Apply partition overwrite as the default idempotent write pattern for daily batch pipelines.

The simplest way to make a pipeline idempotent is to make it replace rather than append. Instead of writing 'add today's orders to the orders table,' the pipeline writes 'set the orders for today to exactly this set of rows.' Set is idempotent; add is not. The change is small and the implications are large, because nearly every batch pipeline can be expressed as a partition replace if the data is partitioned by run date. The mental shift is from thinking about the pipeline as something that contributes new rows to the destination to thinking about it as something that defines what the destination should look like for a given slice of time. Once the shift takes, every retry, every backfill, and every reprocess becomes safe by construction rather than by hope.

The Pattern: Partition Overwrite

A partition is a slice of a table identified by some key, almost always a date. The orders table partitioned by order date has one partition per day. The pipeline that processes April 25 reads the day's source data, computes the rows that belong in the April 25 partition, and writes them as a unit. If the partition already has data, that data is replaced. If the pipeline runs twice, the second run replaces the first run's output, and the table ends up with exactly one set of rows for April 25. If the pipeline fails halfway and is retried, the same logic applies. The partition is the unit of idempotency.
OperationFirst Run ResultSecond Run ResultIdempotent?
INSERT INTO orders SELECT ... WHERE order_date = '2026-04-25'Day's rows addedDay's rows added again on topNo
INSERT OVERWRITE orders PARTITION (order_date='2026-04-25') SELECT ...Partition contains day's rowsPartition replaced with same rows; identical stateYes
DELETE WHERE order_date='2026-04-25'; INSERT SELECT ... WHERE order_date='2026-04-25'Day's rows presentSame day's rows; old ones deleted, new ones insertedYes (when in a transaction)

The SQL Form

1
2CREATE OR REPLACE TABLE daily_orders AS
3SELECT
4 order_id,
5 order_date,
6 customer_id,
7 amount
8FROM staging_orders
9WHERE order_date = '2026-04-25'
The SELECT produces the same rows on every run because the source data for April 25 is fixed. Wrapping it in a CREATE OR REPLACE or an INSERT OVERWRITE makes the destination match the SELECT exactly. There is no path by which a retry leaves duplicates because the write is not additive; it is a replacement. The same pattern works in Snowflake (CREATE OR REPLACE TABLE), BigQuery (CREATE OR REPLACE TABLE), Spark (INSERT OVERWRITE), and Hive-style partitioned tables.

Why the Date Partition Is the Right Unit

Daily batch pipelines almost always process one day's data per run. The unit of work and the unit of storage match: each run touches exactly one date partition. That match is what makes partition overwrite the natural fit. A pipeline that runs hourly partitions by hour. A pipeline that runs every fifteen minutes partitions by fifteen-minute window. The unit of partition follows the unit of run. Picking the wrong unit is one of the few ways to make this pattern fail; if a pipeline runs hourly but partitions by day, an hourly retry overwrites the entire day's accumulated work. The mismatch is the single most common partition-overwrite bug, and the diagnosis is fast: count the number of partition writes per logical day and check whether the number matches the schedule's expected number of runs per day.
The partition-overwrite recipe:
  • Partition the destination table by the same time grain as the pipeline's run cadence
  • Each run computes the rows for its own partition only
  • The write is OVERWRITE or CREATE OR REPLACE, never INSERT
  • Other partitions are untouched, so concurrent runs on different dates do not interfere

The Failure Case Walked Through

1# A daily run that fails halfway, then retries
2for attempt in [1, 2]:
3 rows = compute_orders_for(run_date='2026-04-25')
4 snowflake.execute(""" CREATE OR REPLACE TABLE staging.daily_orders_2026_04_25 AS SELECT * FROM table(?) """, rows)
5 if attempt == 1:
6 raise CrashSimulated() # First attempt crashes
7 snowflake.execute(""" INSERT OVERWRITE INTO orders PARTITION(order_date='2026-04-25') SELECT * FROM staging.daily_orders_2026_04_25 """)
The first attempt crashes after creating the staging table but before the partition overwrite. The second attempt re-creates the staging table from scratch (CREATE OR REPLACE makes the staging step idempotent too) and then performs the partition overwrite. The final state has exactly one set of rows for April 25, regardless of how many times the script crashed and restarted. Every step in the chain is idempotent on its own, which makes the pipeline as a whole safe to retry. Chain safety is the invariant: a pipeline composed of idempotent steps is itself idempotent, while a pipeline with even one non-idempotent step in the chain is non-idempotent overall. The discipline is to apply the property at every step, not at one chosen step.
Naive Append
  • INSERT INTO orders SELECT ... WHERE order_date = today
  • Each retry adds rows on top
  • Crashes mid-write leave partial garbage in the table
  • Backfilling April 1 today appends April 1 again
Partition Overwrite
  • INSERT OVERWRITE orders PARTITION(order_date) SELECT ...
  • Each retry replaces the partition; final state is identical
  • Crashes mid-write are recovered by retry; the next run fully replaces
  • Backfilling April 1 today replaces April 1; no duplication
Do
  • Partition every batch destination table by the pipeline's run cadence (day, hour, fifteen minutes)
  • Use INSERT OVERWRITE or CREATE OR REPLACE rather than INSERT for the final write
  • Make every intermediate step idempotent on its own; the chain is only as safe as its weakest link
Don't
  • Use INSERT INTO followed by 'maybe DELETE the old rows' as the strategy; the maybe is the bug
  • Partition by a grain coarser than the run cadence (hourly job writing to daily partition)
  • Assume retries are safe because they have not caused trouble yet; trouble arrives on partial failure
Source
source
Spark job
job
MERGE / overwrite by key
target

Idempotency = replace, do not append. Write by MERGE or partition-overwrite keyed on a business key, so re-running the job (a retry or a backfill) produces the same rows instead of doubling them.

Why Retries Need Idempotency

Daily Life
Interviews

Explain why automatic retries are unsafe without idempotency and trace the duplicate accumulation that occurs when the property is missing.

Retries are how pipelines survive the noisy reality of distributed systems. A network blip, a brief warehouse contention spike, an upstream rate limit triggered by a sudden traffic surge, a transient AZ outage, a Spark task failing because its executor lost a heartbeat. None of these are bugs; they are weather. A retry the next minute almost always succeeds. Orchestrators ship with retry support built in because retries are that fundamental to operations; turning them off would mean paging a human on every blip, which no team can sustain. The catch is that retries are only safe on idempotent pipelines. On non-idempotent pipelines, every retry creates new bugs faster than the original failures it was supposed to recover from.

What Retries Are For

Failure ModeTypical FrequencyWhat a Retry Does
Network timeout to upstream APISeveral times per week per pipelineSecond attempt usually succeeds; the network blip has passed
Warehouse query timeout under loadA few times per month per heavy jobSecond attempt runs when load has cleared
Spot instance preempted mid-executionDaily on a large fleetJob restarts on a different instance and completes
Source database temporarily unreachableOccasional, depending on infra reliabilitySecond attempt connects after the source recovers
Every modern orchestrator (Airflow, Dagster, Prefect) provides automatic retries with configurable counts and backoff. The default retry count is typically two or three. The default backoff is exponential. The defaults exist because they work for the vast majority of failures, which are transient by nature. The defaults assume something the orchestrator cannot verify on its own: that the pipeline being retried can handle being retried.

What a Retry Looks Like Under the Hood

1# What an orchestrator does behind the scenes
2def run_with_retries(task, max_attempts=3, backoff_seconds=60):
3 for attempt in range(1, max_attempts + 1):
4 try:
5 return task.execute()
6 except TransientError as e:
7 if attempt == max_attempts:
8 raise
9 wait = backoff_seconds * (2 ** (attempt - 1))
10 time.sleep(wait)
11 # The orchestrator does not check whether the task is idempotent.
12 # The task author is responsible for that property.
The retry loop is mechanical. The orchestrator runs the task, catches a transient error, waits, and runs the task again. There is no inspection of what the task does, no rollback of partial work, no detection of duplicate writes. The orchestrator assumes the task author has built the task to handle being run repeatedly. When that assumption fails, every retry compounds the original problem.

Retries on a Non-Idempotent Pipeline

What goes wrong:
  • Attempt one writes 100 rows then fails. The 100 rows persist.
  • Attempt two writes the same 100 rows plus the next 200, then fails. 300 partial rows persist.
  • Attempt three writes all 300 rows. Total: 100 + 300 + 300 = 700 rows for what should be 300.
  • The orchestrator marks the run successful because the final attempt did not raise.
The orchestrator is doing exactly what it was told to do. The pipeline is doing exactly what it was told to do. The combination is wrong because the pipeline did not declare 'each retry must replace the previous attempt's output' and the orchestrator does not enforce that contract on its own. The bug is in the gap between the two assumptions.

The Cost of Disabling Retries

A team that has been burned by a duplicate-row incident sometimes responds by disabling retries. Now every transient failure becomes a 3am page. The on-call engineer wakes up, hits retry manually, and goes back to sleep. The fix shifts the work from the machine to a human, but the underlying issue is unchanged: the pipeline is not idempotent. Disabling retries treats the symptom; making the pipeline idempotent treats the cause. The right move is the second one. The first move trades engineering quality for engineer sleep quality, and both end up worse.
Disable Retries
  • Every transient failure becomes a manual page
  • On-call rotation degrades; retention drops
  • The bug is hidden, not fixed; the next non-transient failure still corrupts data
  • Backfills still cause duplicates because they are retries by another name
Make the Pipeline Idempotent
  • Transient failures are absorbed automatically without paging
  • On-call only wakes for genuine problems with named owners
  • The bug is structurally impossible; partial failures self-heal on retry
  • Backfills are safe because the pipeline replaces rather than appends

The Two-Way Contract

Idempotency is a contract between the pipeline and the orchestrator. The orchestrator promises to retry on failure. The pipeline promises that retries are safe. When both sides hold up the contract, transient failures are invisible to the consumer. When either side fails, the consumer notices. Building idempotency into the pipeline is the discipline that makes the contract real, and it is the prerequisite for every higher-order operation in this curriculum: backfills, reprocessing, schema migrations, even some quality remediation. None of those work on a non-idempotent foundation. The contract is also a load-bearing assumption in tooling: dbt's incremental models assume idempotency, Airflow's catchup assumes it, Dagster's partitioned assets assume it. The frameworks are designed around the contract, and pipelines that violate it surprise the framework, which is rarely a pleasant kind of surprise.

Idempotency is a precondition, not a feature. Every other operational property of a pipeline (safe retries, backfills, reprocessing, recovery) builds on top of it. A non-idempotent pipeline cannot be operated; it can only be hoped at.

alert
Retries are the default response to transient failure; orchestrators ship with them on.
check
Retries are safe only when the pipeline is idempotent; the orchestrator does not enforce that property.
query
Disabling retries treats the symptom and exhausts the on-call; making the pipeline idempotent treats the cause.
TIP
When configuring retries on a new task, ask 'is this task idempotent' before setting the retry count. If the answer is no, fix the task; do not configure around the missing property.

Side by Side: Idempotent vs Not

Daily Life
Interviews

Compare a non-idempotent pipeline against its idempotent fix line by line and explain which lines carry the property.

The clearest way to internalize the property is to read two short pipelines side by side. One is non-idempotent. The other does the same job idempotently. The diff is small. The behavioral difference under retry is enormous. The exercise below walks through both, line by line, and names what changes. Reading two side-by-side examples is faster than reading a hundred lines of explanation because the diff is the explanation. The asymmetry is the lesson: idempotency is a small change in code that produces a large change in operational behavior, and seeing the small change directly is the cheapest path to remembering it.

Pipeline A: The Bug

1# Pipeline A: aggregates daily orders into a summary table
2# This pipeline is not idempotent
3
4def run(run_date):
5 rows = postgres.fetch(f""" SELECT customer_id, COUNT(*) AS order_count, SUM(amount) AS total FROM orders WHERE order_date = '{run_date}' GROUP BY customer_id """)
6 for row in rows:
7 snowflake.execute(
8 "INSERT INTO daily_customer_summary VALUES (?, ?, ?, ?)",
9 (run_date, row.customer_id, row.order_count, row.total),
10 )
Pipeline A reads the day's orders, groups by customer, and inserts one row per customer into the summary table. It is a perfectly reasonable first pass. It also fails the two-run test. Run it twice on the same date and the table contains two copies of every customer's summary for that date. Run it three times and there are three copies. The aggregation logic is correct; the write semantics are not. The aggregation could be ported to Spark, to dbt, to a serverless function, to an in-memory script, and the bug would survive every port unchanged because the bug lives in the write semantics, not in the language. Recognizing that the bug is portable is the senior insight; treating it as portable in the fix is the senior discipline.

Pipeline B: The Fix

1# Pipeline B: same logic, idempotent write
2# Running this twice produces the same final state
3
4def run(run_date):
5 rows = postgres.fetch(f""" SELECT customer_id, COUNT(*) AS order_count, SUM(amount) AS total FROM orders WHERE order_date = '{run_date}' GROUP BY customer_id """)
6 snowflake.execute(
7 "DELETE FROM daily_customer_summary WHERE summary_date = ?",
8 (run_date,),
9 )
10 snowflake.execute_many(
11 "INSERT INTO daily_customer_summary VALUES (?, ?, ?, ?)",
12 [(run_date, r.customer_id, r.order_count, r.total) for r in rows],
13 )
14 snowflake.commit()
Pipeline B differs from Pipeline A in three ways: the writes happen inside a transaction, a DELETE clears the partition before the INSERT, and the INSERT happens once for all rows rather than row by row. The DELETE plus INSERT pattern is the second of the three idempotent write patterns the intermediate tier covers; the partition overwrite from the previous section is the first. Both produce the same final state, and both make retries safe. The choice between them depends on the destination's capabilities. Postgres supports DELETE-then-INSERT in a transaction natively; Snowflake supports both DELETE-then-INSERT and INSERT OVERWRITE; BigQuery and Iceberg favor INSERT OVERWRITE. The pattern follows the destination, and reading documentation for the specific destination at the start of a project is worth the few minutes it takes.
PropertyPipeline APipeline B
Aggregation logicCorrectCorrect (identical)
Write strategyINSERT onlyDELETE then INSERT inside a transaction
Result of running onceOne row per customer for the dateOne row per customer for the date
Result of running twiceTwo rows per customer for the dateOne row per customer for the date
Result of partial failure plus retrySome customers duplicated, others notIdentical to a clean run
Safe to use with orchestrator retriesNoYes

Why the Transaction Matters

The DELETE and the INSERT in Pipeline B run inside a transaction. If the INSERT fails halfway, the transaction rolls back, and the table returns to its state before the DELETE. The destination never observes a state where the day's old rows are gone but the new rows have not arrived. Without the transaction, a crash between DELETE and INSERT would leave the table missing data for the day, which is its own kind of bug. The transaction is what makes the two-step pattern safe end to end. The same logic applies in Snowflake (where transactions are explicit), in Postgres (where they are explicit), and in BigQuery (where multi-statement transactions are supported). The transaction wrapper is non-optional even when it adds visual noise to the code.
Three rules for the DELETE-then-INSERT pattern:
  • Both statements run inside the same transaction
  • The DELETE filters by the same partition key the INSERT writes to
  • The transaction commits only when both statements succeed

Even Simpler: CREATE OR REPLACE

1# A daily customer summary pipeline using CREATE OR REPLACE
2# This is the simplest idempotent shape when the destination is a single date partition
3
4run_date = '2026-04-25'
5
6source_rows = [
7 {'customer_id': 'cust_801', 'order_count': 3, 'total': 142.50},
8 {'customer_id': 'cust_802', 'order_count': 1, 'total': 67.00},
9 {'customer_id': 'cust_803', 'order_count': 2, 'total': 89.75},
10]
11
12# Replace operation: state at end is the source rows, regardless of prior state
13daily_summary = {(run_date, r['customer_id']): r for r in source_rows}
14
15for key, row in daily_summary.items():
16 print(key, row)
17
18print(f"\nTotal rows for {run_date}: {len(daily_summary)}")
19print("Run this script twice; the result is identical.")
The Python sketch shows the property in miniature. The dictionary mapping from key to row is the destination. Each run constructs the destination from the source rows. Running once or running ten times produces the same dictionary. The same shape, applied at warehouse scale, is the partition overwrite pattern. The conceptual move is small. The discipline of always using it is the difference between an operational pipeline and a hopeful one. The miniature is also worth running mentally on every new piece of pipeline code; if the code does not pass the same test the dictionary passes, the code is not idempotent yet.

What the Diff Costs

Cost DimensionPipeline A (Append)Pipeline B (Replace)
Lines of code1316
Compute per runOne INSERT per rowOne DELETE plus one batched INSERT
Operational reliabilityBrittle under retrySafe under retry and backfill
Cost of a partial failure incidentHours of investigation, possible data correctionHit retry; the next run cleans up
Three more lines of code. One delete operation per run instead of zero. The marginal cost is small. The marginal benefit is that the pipeline can be operated rather than hoped at. Treat the diff as the price of admission for any pipeline that writes to a shared destination, and adopt it everywhere by default. The cases where partition overwrite is wrong are rare and specific; the intermediate tier covers them in detail. Adopting the diff as a team-wide default is also more sustainable than adopting it case by case. Defaults are easier to maintain across many engineers than discretionary practices, because defaults survive context switches and code reviews while discretionary practices erode.
Do
  • Read every batch pipeline as either Pipeline A or Pipeline B; convert all the As to Bs
  • Wrap DELETE-then-INSERT in a transaction; never run the two statements separately
  • Default to CREATE OR REPLACE or INSERT OVERWRITE when the destination supports it
Don't
  • Add INSERT-only writes to a destination that already accepts other writes; coexistence multiplies bugs
  • Skip the transaction wrapper because 'the DELETE and INSERT are right next to each other'
  • Treat the three extra lines as boilerplate to remove; they are the property
TIP
When reviewing a teammate's pipeline PR, the first thing to check is the write statement. If it begins with INSERT INTO and there is no preceding DELETE or OVERWRITE, ask the question: what happens if this runs twice.
PUTTING IT ALL TOGETHER

> A junior engineer at a Series B startup writes the company's first nightly pipeline. It reads new orders from Postgres and inserts them into a Snowflake summary table. The pipeline runs successfully every night for two months. On the sixty-third night, the warehouse query times out at row 180,000 of 250,000. The orchestrator retries automatically. The next morning the finance team reports revenue is up 41 percent. The CTO asks the data team to prevent this failure mode from ever recurring.

Name the property that was missing: idempotency. Running the pipeline once and running it twice produced different results because the writes were INSERT-only. Naming the property is the first step toward fixing it everywhere.
Apply the simplest fix: replace, do not append. The destination table is partitioned by date. Each run uses INSERT OVERWRITE for its own partition. Retries replace the previous attempt; duplicates are structurally impossible.
Audit the rest of the pipelines for the same shape. Every INSERT-only batch pipeline has the same bug waiting. Convert each one to partition overwrite or DELETE-then-INSERT inside a transaction. The audit takes a week and prevents the next dozen incidents.
Keep retries enabled. Idempotency is the precondition that makes retries safe; disabling them shifts the work to humans and makes the on-call rotation worse. The right contract is 'orchestrator retries on failure; pipeline tolerates retries.' Both halves are required.
KEY TAKEAWAYS
Idempotency in one sentence: running the pipeline twice produces the same final state as running it once. The two-run test is mechanical and fast.
Append-only writes plus retries equal duplicates: the bug is silent because each row is individually valid. Detection requires explicit instrumentation that rarely exists.
Replace, do not append: INSERT OVERWRITE or CREATE OR REPLACE makes a partition match the SELECT exactly. Retries are safe; backfills are safe.
Retries are a two-way contract: the orchestrator promises to retry; the pipeline promises retries are safe. Both halves are required for the contract to work.
Idempotency is a property of the write, not a feature added later: the choice between INSERT, OVERWRITE, and DELETE-then-INSERT determines whether retries are safe. Every other operational property builds on top.

Running the same pipeline twice should produce the same result, not double the rows

Category
Pipeline Architecture
Difficulty
beginner
Duration
25 minutes
Challenges
0 hands-on challenges

Topics covered: The Retry That Doubled the Rows, Idempotency in One Sentence, Replace, Do Not Append, Why Retries Need Idempotency, Side by Side: Idempotent vs Not

Lesson Sections

  1. The Retry That Doubled the Rows (concepts: paIdempotency)

    Pipelines fail. Networks blink, instances die, upstream APIs return 500s, the warehouse runs out of memory, a credential expires, a Spark executor gets evicted from a spot pool, an S3 bucket policy changes overnight, a DNS record propagates slowly. Failure is not the exception in production data pipelines; it is the background hum. The right response to a failed run is almost always to run it again. The wrong response is to run it again on a pipeline that does not handle being run twice. The wro

  2. Idempotency in One Sentence (concepts: paIdempotency)

    Idempotency is one of those words that sounds harder than the idea it names. The mathematical definition is short: an operation is idempotent if applying it twice gives the same result as applying it once. The data engineering definition is even shorter, because the operation in question is always 'run the pipeline.' Running an idempotent pipeline twice produces the same end state as running it once. That is the entire concept. The word entered software engineering through HTTP, where GET, PUT,

  3. Replace, Do Not Append (concepts: paIdempotency)

    The simplest way to make a pipeline idempotent is to make it replace rather than append. Instead of writing 'add today's orders to the orders table,' the pipeline writes 'set the orders for today to exactly this set of rows.' Set is idempotent; add is not. The change is small and the implications are large, because nearly every batch pipeline can be expressed as a partition replace if the data is partitioned by run date. The mental shift is from thinking about the pipeline as something that cont

  4. Why Retries Need Idempotency (concepts: paIdempotency)

    Retries are how pipelines survive the noisy reality of distributed systems. A network blip, a brief warehouse contention spike, an upstream rate limit triggered by a sudden traffic surge, a transient AZ outage, a Spark task failing because its executor lost a heartbeat. None of these are bugs; they are weather. A retry the next minute almost always succeeds. Orchestrators ship with retry support built in because retries are that fundamental to operations; turning them off would mean paging a hum

  5. Side by Side: Idempotent vs Not (concepts: paIdempotency)

    The clearest way to internalize the property is to read two short pipelines side by side. One is non-idempotent. The other does the same job idempotently. The diff is small. The behavioral difference under retry is enormous. The exercise below walks through both, line by line, and names what changes. Reading two side-by-side examples is faster than reading a hundred lines of explanation because the diff is the explanation. The asymmetry is the lesson: idempotency is a small change in code that p