Idempotency and Backfill: Intermediate

A logistics company at scale ran 400 nightly batch pipelines. The product team needed a corrected revenue figure for a single quarter after a billing system migration. The data engineering team estimated three weeks. The finance team had budgeted three days. The reason for the gap was that two thirds of the pipelines were not idempotent. Backfilling each one required custom logic, manual cleanup, and per-pipeline review. The next quarter, the data engineering team did the structural work: every pipeline got partition keys, every write became MERGE or INSERT OVERWRITE, every transform accepted explicit time bounds rather than reading the wall clock. The next backfill took six hours. The shift was not in the data; it was in the property the pipelines now had. This lesson walks through the three idempotent write patterns, the business-key choices that make them work, and the explicit time-bound discipline that distinguishes a replayable pipeline from a one-shot script.

What you will be able to do

Choose between partition overwrite, MERGE, and DELETE-then-INSERT based on the destination and the data shape

Pick a business key that makes MERGE deterministic and explain natural versus surrogate versus composite

Apply explicit time bounds to transforms so backfills produce the same answer as live runs

Three Idempotent Write Patterns

Daily Life

Interviews

Choose between partition overwrite, MERGE, and DELETE-then-INSERT based on the destination shape and the run's unit of work.

Three write patterns cover the vast majority of idempotent batch pipelines. Partition overwrite replaces a slice of the destination table identified by a partition key. MERGE matches incoming rows to existing rows by a business key and updates or inserts as appropriate. DELETE-then-INSERT inside a transaction clears a logical slice and writes its replacement atomically. Each pattern has a niche, and a senior engineer reaches for the right one without thinking. Reaching for the wrong one produces pipelines that are technically idempotent but operationally awkward. The names of the three patterns are not standardized across the industry; one team's UPSERT is another team's MERGE is another team's DELETE-then-INSERT is another team's INSERT OVERWRITE. The conceptual shapes below are stable across the naming variations, and recognizing the shape is the senior skill, not knowing the vendor's preferred verb.

Pattern 1: Partition Overwrite

Partition overwrite is the cleanest pattern when the data is naturally partitioned by time and each pipeline run is responsible for a single partition. The pipeline computes the rows for its partition, and the destination supports a write mode that replaces the partition atomically. Every modern table format supports this: Snowflake, BigQuery, Spark on Hive, Iceberg, Delta. The pattern is the default for daily batch pipelines that ingest into a curated layer.

	INSERT OVERWRITE TABLE daily_orders PARTITION(order_date = '2026-04-25')
	SELECT
	order_id,
	customer_id,
	amount,
	status
	FROM staging_orders
	WHERE order_date = '2026-04-25'

Pattern 2: MERGE on a Business Key

MERGE is the right choice when each incoming row has a stable identity across runs and the destination tracks current state rather than historical snapshots. The classic case is a slowly changing dimension: the customer table has one row per customer, and each pipeline run updates the rows that changed and inserts the rows that are new. The matching key is the business key, the column that uniquely identifies each entity in the source system. MERGE is supported in Snowflake, BigQuery, Postgres (as INSERT ON CONFLICT), Iceberg, and Delta. The semantic differences across these implementations are mostly cosmetic; the matching clause, the WHEN MATCHED action, and the WHEN NOT MATCHED action exist in every dialect. The portability of the pattern is one of its strengths; pipelines that use MERGE today on Snowflake can be ported to BigQuery or Iceberg with surface-level edits, while pipelines that depend on a vendor-specific UPSERT often cannot.

	MERGE INTO dim_customer AS target
	USING staging_customer_changes AS source
	ON target.customer_id = source.customer_id
	WHEN MATCHED THEN UPDATE SET target.email = source.email, target.country = source.country, target.tier = source.tier, target.updated_at = source.updated_at
	WHEN NOT MATCHED THEN INSERT (customer_id, email, country, tier, updated_at
	) VALUES (source.customer_id, source.email, source.country, source.tier, source.updated_at
	)
	/* Pattern 2: MERGE on customer_id */

MERGE is idempotent when run on the same input twice because the second run finds the same matches and applies the same updates. The result is identical. The risk is when the same staging row appears twice with different values; MERGE applies the last one it sees, which is non-deterministic if the source has duplicates. The standard guard is to deduplicate the source on the business key before the MERGE, picking the latest row by updated_at.

Pattern 3: DELETE-then-INSERT in a Transaction

DELETE-then-INSERT is the fallback for destinations that do not support partition overwrite or MERGE, or for cases where the partition is logical rather than physical. A pipeline that writes a daily summary to a Postgres operational table for an internal app might use this pattern because Postgres does not have INSERT OVERWRITE PARTITION, and the table is partitioned only logically by a date column. The DELETE clears the logical partition; the INSERT writes the new rows; the transaction makes the pair atomic.

	BEGIN ; DELETE

	FROM daily_customer_summary
	WHERE summary_date = '2026-04-25' ; INSERT INTO daily_customer_summary(summary_date, customer_id, order_count, total)

	SELECT
	'2026-04-25',
	customer_id,
	COUNT(*),
	SUM(amount)
	FROM orders
	WHERE order_date = '2026-04-25'
	GROUP BY customer_id ; COMMIT ;

Choosing Among the Three

Pattern	Best Fit	Failure Mode
Partition overwrite	Time-partitioned destination, one partition per run	Wrong partition grain (hourly job overwrites daily partition)
MERGE on business key	Current-state tables (one row per customer, updated as the customer changes); upserts from CDC streams	Duplicate keys in source; non-deterministic merge result
DELETE-then-INSERT in transaction	Operational databases without partition support; logical partitioning	Forgotten transaction wrapper; partial visibility between DELETE and INSERT

Partition OverwriteMERGEDELETE-then-INSERT

Partition Overwrite

Default for time-partitioned data

INSERT OVERWRITE PARTITION (date='X'). Atomic per partition. Snowflake, BigQuery, Spark, Iceberg, Delta all support it.

MERGE

Default for current-state tables

MERGE INTO target USING source ON business_key. Updates matched, inserts unmatched. Postgres uses INSERT ON CONFLICT.

DELETE-then-INSERT

Fallback when others do not fit

BEGIN; DELETE WHERE partition_col = X; INSERT ...; COMMIT. Atomic via transaction. Works on every relational database.

Combining Patterns in One Pipeline

Real pipelines often combine all three. The raw landing zone uses partition overwrite by ingestion date. The curated fact tables use partition overwrite by business date. The dimension tables use MERGE on the business key. The serving marts that feed dashboards may use DELETE-then-INSERT inside a transaction because the destination is the operational application database. Treating the choice as per-table rather than per-pipeline lets each destination get the pattern that fits its shape, and lets the pipeline as a whole stay idempotent end to end. The architecture diagram of a mature data platform shows all three patterns in use, often in the same DAG, and a senior engineer reading the diagram can name which destination uses which pattern from the shape of the table alone.

Quick decision rule:

▸Time-partitioned destination, one partition per run: partition overwrite
▸Stable entities with one row each, updated over time: MERGE on business key
▸Logical partition in a destination without OVERWRITE or MERGE: DELETE-then-INSERT in a transaction

TIP

When uncertain which pattern applies, ask 'what is the unit of work for one pipeline run' and 'what is the unit of identity in the destination.' If those two units match, pick the pattern that matches them; partition for time-based units, key for entity-based units.

Choosing a Business Key

Daily Life

Interviews

Pick a business key (natural, surrogate, or composite) for an idempotent MERGE and enforce uniqueness as the foundation.

MERGE and deduplication both depend on a key that uniquely identifies each row. The right key seems obvious until the pipeline ingests its first edge case: an order with a null user_id, two events with the same timestamp, a CDC stream that emits one event for the row before the change and another for after, sharing the same primary key, a vendor that recycles IDs after a long enough interval, a soft delete that resurfaces the same logical row weeks later. Picking the wrong key turns an idempotent pipeline into a non-idempotent one without changing any other code. Picking the right key requires a clear-eyed view of where the key comes from, how stable it is across upstream changes, and what happens at the seams where multiple sources contribute to the same destination. The key choice is one of the most consequential and least visible decisions in a pipeline; it rarely shows up in code review but determines correctness for years.

Three Kinds of Keys

Kind	What It Is	Example
Natural key	A column that already identifies the entity in the real world	email address, ISBN, social security number
Surrogate key	A meaningless identifier the system assigns	auto-incremented order_id, UUID, hash of all source columns
Composite key	Multiple columns whose combination is unique	(order_id, line_item_number); (customer_id, event_timestamp, event_type)

Natural Keys: Cheap to Read, Easy to Break

Natural keys are convenient because they already exist in the source. The customer table can use email as the key; everyone has an email, and a customer is identified by it. The trap is that natural keys belong to the real world, and the real world changes. A customer changes their email. A book gets reissued under a new ISBN. A company merges and the email domain changes. When the natural key changes, the row in the destination silently becomes a new entity, and historical lineage breaks. Natural keys are tempting in the short term and painful at the boundary where the real world updates.

Surrogate Keys: Stable but Require Generation Discipline

Surrogate keys are values the data system itself assigns. The most common forms are auto-incremented integers, UUIDs, and hashes of source columns. Surrogate keys are stable because they do not depend on any real-world property; they depend only on a deterministic generation rule. The discipline they require is that the generation rule produces the same key for the same logical row across runs. An auto-increment that assigns 1, 2, 3 on the first run and 4, 5, 6 on the second is a non-idempotent surrogate, which defeats the purpose. A hash of the source's stable identifying columns is the safest form because it is reproducible from the input alone. Composite keys are the natural choice when no single column identifies a row; order line items use (order_id, line_number), and multi-tenant event streams use (tenant_id, tenant_event_id). Composite keys are honest about the shape of the data. The honesty has practical value at debugging time: a composite key spells out which dimensions had to combine to produce uniqueness, which is the same information needed to write a correct deduplication query when the key constraint fails.

•Natural Key Risks

Real-world changes (email update, ISBN reissue) break identity
Source-system bugs propagate as duplicate or merged rows
Cross-system joins assume consistent natural keys; rarely true
PII in the key complicates redaction and access control

✓Surrogate Key Discipline

Generation rule produces the same key for the same logical row
Real-world changes are visible as updates to the entity, not new entities
Cross-system joins use the surrogate; natural keys become attributes
PII stays out of the key; the key is opaque and stable

The Foundation: Unique Key Constraint

Whichever kind of key is chosen, the destination must enforce uniqueness on it. A primary key constraint, a unique index, or an explicit assertion in the pipeline. Without enforcement, a bug that produces duplicate keys lands silently in the table, and downstream MERGEs become non-deterministic. Enforcement turns a duplicate-key event into a loud failure that can be diagnosed at write time, rather than a silent corruption that surfaces at quarter end. Most modern warehouses do not enforce primary keys at the engine level (Snowflake and Redshift treat them as informational); the responsibility falls to the pipeline itself, usually as a quality check immediately after the write. Iceberg and Delta Lake provide stronger guarantees if the table is configured correctly, but configuration is the operative word; the pipeline author has to opt in. The default-on-by-default expectation borrowed from Postgres does not transfer to most analytical systems, and assuming it does is a common source of duplicate-key bugs.

	/* A uniqueness assertion that runs after every write */
	/* Returns zero rows if the key is unique; non-zero rows are duplicates */
	SELECT
	customer_id,
	COUNT(*) AS dup_count
	FROM dim_customer
	GROUP BY customer_id
	HAVING COUNT(*) > 1

A passing assertion returns zero rows. A failing assertion returns the duplicates. The assertion runs as a quality gate after the MERGE; if it fires, the run is marked failed and the on-call investigates. The assertion is cheap to compute and catches the entire class of bugs that would otherwise corrupt downstream aggregates. It is the single most valuable line of SQL to add to any MERGE-based pipeline.

Hashing for Reproducible Surrogates

	# A reproducible surrogate key: hash of stable source columns
	import hashlib

	def surrogate(source_system, source_id, event_timestamp):
	raw = f"{source_system}\|{source_id}\|{event_timestamp}"
	return hashlib.md5(raw.encode()).hexdigest()[:16]

	# Same input -> same key, every time
	print(surrogate('stripe', 'evt_8a91b', '2026-04-25T14:33:08Z'))
	print(surrogate('stripe', 'evt_8a91b', '2026-04-25T14:33:08Z'))
	print(surrogate('stripe', 'evt_8a91c', '2026-04-25T14:33:08Z'))

The hash is deterministic in the input. Running the function twice on the same source row yields the same surrogate. Running it on a different source row yields a different one. The pipeline can use this surrogate as the MERGE key, the dedup key, or the partition key, and the property that running twice produces the same result holds because the keys are fixed by the input rather than generated at runtime.

Three signs the wrong key was chosen:

▸MERGE produces different results on identical input across runs
▸The destination accumulates near-duplicates that differ in one obscure column
▸Backfills create rows that exist alongside the rows from the original run rather than replacing them

✓Do

Prefer surrogate keys generated as hashes of stable source columns
Add a uniqueness assertion immediately after every MERGE
Document the key choice in the pipeline contract; future maintainers need to know what is stable

✗Don't

Use email or other mutable natural attributes as the primary key
Rely on auto-increment for surrogate keys that need to be reproducible across runs
Skip the uniqueness check because 'the warehouse does not enforce constraints anyway'

TIP

When a MERGE produces duplicates after a refactor, the cause is almost always the key. Check whether the source ever produces two rows with the same key in a single run before assuming the MERGE itself is wrong.

Reads Can Be Non-Idempotent Too

Daily Life

Interviews

Replace wall-clock reads with explicit time bound parameters and distinguish event time from processing time in every transform.

Most discussions of idempotency focus on writes. The hidden second half is reads. A pipeline that reads non-deterministic input cannot be idempotent in any useful sense, because the same logical run produces different output on different days. The most common offenders are SELECT NOW(), CURRENT_DATE, and any function that resolves to wall-clock time inside a transform. Other offenders include random number generators without seeds, environment variables that drift between runs, and any external service whose response varies over time without a corresponding input change. The fix is not subtle and it is not optional. Pipelines should accept explicit time bounds as parameters, not derive them from the current moment, and any other source of run-time variability needs the same treatment: pin it in the parameters or accept that the pipeline is non-idempotent in reads.

The NOW() Bug


	INSERT INTO daily_active_users
	SELECT
	user_id,
	COUNT(*) AS event_count
	FROM events
	WHERE event_timestamp >= DATEADD('day', - 1, NOW()) AND event_timestamp < NOW()
	GROUP BY user_id ;

Read this query at 11:30am on Tuesday and it returns Monday 11:30am through Tuesday 11:30am. Read it at 2:15pm and it returns Monday 2:15pm through Tuesday 2:15pm. The boundaries shift continuously. Running the same pipeline at 11:30 and again at 2:15 produces different results, neither of which corresponds to a clean calendar day. Backfilling the pipeline a week from now is impossible because NOW() will have advanced; there is no way to get the data the original run produced.

The Fix: Explicit Time Bounds


	INSERT OVERWRITE TABLE daily_active_users PARTITION(activity_date = '2026-04-25')
	SELECT
	user_id,
	COUNT(*) AS event_count
	FROM events
	WHERE event_timestamp >= '2026-04-25 00:00:00' AND event_timestamp < '2026-04-26 00:00:00'
	GROUP BY user_id

The fixed query takes the date as a parameter. The pipeline framework, the orchestrator, or the dbt model passes the value at run time. Running the query for April 25 today produces the same result as running it next week or next month. The query is now idempotent in the read direction, which is the prerequisite for idempotency in the write direction; INSERT OVERWRITE on top of a non-deterministic SELECT is still non-deterministic, with different rows being overwritten on each run.

Where NOW() Hides

Hidden NOW()	What It Looks Like	Why It Bites
Direct NOW() in the WHERE clause	WHERE event_ts >= NOW() - INTERVAL '1 day'	Window slides with execution time; backfill impossible
CURRENT_DATE in a JOIN condition	ON s.snapshot_date = CURRENT_DATE	Joins to today's snapshot regardless of which day is being processed
Default for a partition column	INSERT INTO t (dt, ...) VALUES (CURRENT_DATE, ...)	Partition column reflects run time, not data time
Implicit in source CDC stream cutoff	Reads up to whatever was in Kafka when the consumer connected	Backfill to last Tuesday cannot reproduce last Tuesday's offsets

The Two Time Concepts

Pipelines operate on two distinct time concepts. Event time is the moment when the event being recorded actually happened: when the order was placed, when the user clicked, when the sensor read. Processing time is the moment when the pipeline ran. The two are different even on the happy path; events that happened at 11:59pm may not arrive in the source until 12:03am. The two are very different in failure or backfill scenarios; an event that happened on April 25 may be reprocessed by a backfill that runs in May. Filtering by event time produces stable, idempotent results. Filtering by processing time does not.

•Filter by Processing Time

Boundaries shift with every execution
Backfill produces different results than the original run
Late-arriving data is silently included or excluded based on when the pipeline notices it
Cross-pipeline joins drift because each pipeline ran at a slightly different moment

✓Filter by Event Time

Boundaries are fixed by the parameters passed in
Backfill produces the same result as the original run for any historical day
Late-arriving data has a clear policy: included if the window is reopened, excluded otherwise
Cross-pipeline joins align on the event date both pipelines processed

How Modern Frameworks Pass the Parameters

	# An Airflow task that passes explicit time bounds via the templating system
	run_orders_aggregation = SnowflakeOperator(
	task_id='aggregate_daily_orders',
	sql=""" INSERT OVERWRITE TABLE daily_orders PARTITION (order_date = '{{ ds }}') SELECT order_id, customer_id, amount FROM staging_orders WHERE order_timestamp >= '{{ ds }} 00:00:00' AND order_timestamp < '{{ next_ds }} 00:00:00' """,
	schedule_interval='@daily',
	)

	# {{ ds }} is the logical execution date for this run, not the wall clock

Airflow templates provide a logical execution date that the orchestrator binds to each scheduled or backfilled run. dbt has the same idea via vars and the dbt_utils.dateadd macros. Spark structured streaming exposes event time as a first-class concept. Every modern framework supports explicit time bounds because every modern framework was built by people who had been burned by NOW(). The framework support is there; the discipline of using it has to come from the pipeline author.

The two time-bound rules:

▸Every transform that filters by time accepts the time bounds as parameters, never reads the wall clock
▸Partition columns and date filters use event time, not processing time
▸Defaults for development should be a fixed historical date, not CURRENT_DATE; defaults bleed into production

A pipeline that uses NOW() is not really running on April 25; it is running on whenever someone happens to invoke it. The two coincide on the happy path and diverge on every other path.

NOW() and CURRENT_DATE inside transforms make backfills produce different results than original runs.

Event time is the canonical filter; processing time is a hidden non-determinism.

Modern orchestrators (Airflow ds, Dagster partitions, dbt vars) pass explicit time bounds for free; using them is a discipline, not a tooling problem.

TIP

When auditing a pipeline for idempotency, search the SQL for NOW, CURRENT_DATE, CURRENT_TIMESTAMP, SYSDATE, and GETDATE. Each occurrence is a candidate bug; each one needs to be replaced with an explicit parameter or justified in writing.

Source

source

Spark job

job

MERGE / overwrite by key

target

Idempotency = replace, do not append. Write by MERGE or partition-overwrite keyed on a business key, so re-running the job (a retry or a backfill) produces the same rows instead of doubling them.

The Backfill Operation

Daily Life

Interviews

Run a backfill over a historical date range and explain why backfill only works on idempotent pipelines.

Backfill is the act of running a pipeline over historical date ranges, usually to fix data that was wrong or to populate a new pipeline with history it was not built to capture in real time. Backfill is the operational payoff of idempotency. A pipeline that is idempotent supports backfill almost for free: pass a different date range, run the pipeline, get the right answer. A pipeline that is not idempotent does not support backfill at all; running it on a historical date corrupts whatever data is currently there. The asymmetry is large enough that a single backfill request from finance or product can determine whether the pipelines it touches are operationally healthy or operationally hostile, and that determination shows up in retention numbers for the data engineering team that has to do the backfill.

Why Backfills Are Common

Reason for Backfill	Typical Frequency	Time Range
New downstream consumer needs historical data	Several times per quarter per active pipeline	30 days to 2 years
Bug fix changed how rows are computed; old runs need correction	A few times per year per pipeline	Days to months, depending on when the bug landed
Source corrected its data; downstream needs to reprocess	Occasional, depending on source quality	Days to weeks
Schema migration; old data needs to be reshaped to the new schema	Once per major migration	Full history of the affected tables

Backfills are not a rare emergency operation. They are an everyday part of operating a data platform. Any pipeline that has been running for more than a year has likely been backfilled at least once. The pipelines that are easy to backfill are pipelines that are pleasant to operate; the pipelines that fight backfill at every step are pipelines that the on-call rotation tries to avoid. Designing for backfill from the start is one of the markers that distinguishes mid-level engineering from senior engineering. The cost of designing for it is small at construction time and large at remediation time, which is the typical shape of operational investments and the typical reason they are skipped under deadline pressure.

How a Backfill Runs

	# A backfill driver in pseudocode
	# The pipeline's run() function takes an explicit date parameter
	from datetime import date, timedelta

	start = date(2026, 4, 1)
	end = date(2026, 4, 25)

	current = start
	while current <= end:
	pipeline.run(run_date=current)
	current += timedelta(days=1)

	# Each call to pipeline.run replaces the partition for that day.
	# The order of dates does not matter for correctness; it only affects ordering of writes.

A backfill is a loop over dates. Each iteration runs the same pipeline that runs in production, with a different date parameter. Because the pipeline is idempotent, each iteration produces the right answer for its date and replaces whatever was there before. The loop completes when every date has been processed. The result is the same as if the pipeline had run on every one of those days at the time. The backfill is, in a sense, time travel for the pipeline.

Orchestrator Support

Every modern orchestrator has first-class backfill support. Airflow has the airflow dags backfill command and the catchup setting. Dagster has partitioned jobs and the backfill UI. Prefect has a similar concept via deployments and date ranges. The orchestrator handles parallelism, dependency resolution across upstream and downstream tasks, and progress tracking. The orchestrator does not handle the requirement that the pipeline be idempotent; that requirement is the pipeline author's. With idempotency, backfill is a feature. Without it, backfill is a guarantee of corruption.

The four prerequisites for safe backfill:

▸The pipeline accepts an explicit date or date range as a parameter
▸Every transform filters by event time using that parameter, not by NOW()
▸Every write is partition overwrite, MERGE, or DELETE-then-INSERT in a transaction
▸Downstream consumers can tolerate the brief window where data is being rewritten

Forward BackfillBackward Backfill

Forward Backfill

Oldest to newest

Processes from the oldest missing date toward the present. Typical when populating a new pipeline with history. Dashboards see the natural progression as new dates appear.

Backward Backfill

Recent to past

Processes from a recent date back into the past. Typical when fixing a bug introduced at a known date. Today's data is right; yesterday's becomes right shortly after.

The Operational Cost of Backfill

Cost Dimension	What It Looks Like	Mitigation
Compute spike	A 365-day backfill is 365 normal runs in a short window	Throttle parallelism; use a separate compute pool to protect production
Source pressure	Backfill reads from upstream sources; sources may rate-limit	Coordinate with source owners; use cached raw data when available
Downstream visibility	Consumers see data being rewritten; aggregates may shift mid-day	Backfill into a shadow table; swap atomically when complete
Cost attribution	Backfill spend may not be attributable to the requesting team	Tag backfill runs with a project ID; bill back to the requester

Backfills are not free. A 90-day backfill of a heavy aggregation may cost more than a month of normal runs combined. The cost is rarely the blocker (the value of correct data usually exceeds the warehouse spend), but the cost should be visible. Tagging backfill runs and reporting their compute spend separately keeps the conversation honest and prevents the situation where a single backfill quietly doubles the data team's monthly bill.

✓Idempotent Pipeline: Backfill is Easy

Loop over dates; each iteration runs the production pipeline
Orchestrator backfill commands work out of the box
Result is identical to having run on each date originally
Total work: parameterize the dates, click run

•Non-Idempotent Pipeline: Backfill is Bespoke

Custom script per backfill, written from scratch each time
Orchestrator backfill cannot be used; data would corrupt
Each backfill carries human review of the corruption it might cause
Total work: weeks of careful one-off engineering per request

✓Do

Test the backfill path during initial pipeline development; one historical date is enough
Use the orchestrator's first-class backfill support rather than writing custom scripts
Tag backfill runs separately so their compute spend is visible and attributable

✗Don't

Treat backfill as a rare event the pipeline does not need to support; rare-event work always becomes routine work
Run unbounded backfills against production-shared compute; isolate them
Backfill across pipelines that share state without coordinating; one pipeline's backfill can break the next

TIP

When designing a new pipeline, the second test (after the two-run idempotency test) is the backfill test: can the pipeline be invoked for a date a month ago and produce the right answer. If not, the design is incomplete.

Refactor: From ETL to Idempotent

Daily Life

Interviews

Refactor a non-idempotent ETL into an idempotent one by adding a partition key, replacing INSERT with DELETE-then-INSERT, and parameterizing the time window.

The patterns are clearer when applied to a real refactor. The pipeline below is a real-shaped daily ETL that ingests payments from a Stripe-like API, joins them to customer accounts, and writes a daily payments fact table. The original version was written in a hurry and has every common idempotency bug at once. The refactored version applies the three patterns above: partition keys, MERGE on a business key, and explicit time bounds. The diff is the worked example. Refactors of this shape are common because most production pipelines were written under deadline pressure by engineers who had not yet been burned by the failure modes in this lesson; the bugs are not malicious or careless, they are the natural state of a pipeline that has not yet had the property added explicitly.

Before: The Original Pipeline

	# Original payments ETL
	# Three idempotency bugs hidden in 24 lines

	def run():
	today = datetime.now().date()
	payments = stripe_api.list_payments(
	created_after=datetime.now() - timedelta(days=1)
	)
	customers = postgres.fetch("SELECT id, email, country FROM customers")

	customer_map = {c['id']: c for c in customers}
	enriched = []
	for p in payments:
	cust = customer_map.get(p['customer_id'])
	if cust:
	enriched.append({
	**p,
	'customer_email': cust['email'],
	'customer_country': cust['country'],
	})

	for row in enriched:
	snowflake.execute(
	"INSERT INTO fct_payments VALUES (?, ?, ?, ?, ?, ?)",
	(row['id'], row['amount'], row['customer_id'], row['customer_email'],
	row['customer_country'], row['created_at']),
	)

Naming the Three Bugs

Bug	Where It Lives	Why It Breaks Idempotency
Reads NOW() to derive the time window	datetime.now() inside list_payments	Window slides with execution time; backfill impossible
INSERT-only writes	snowflake.execute("INSERT INTO ...")	Each retry adds rows on top; duplicates accumulate
No partition key on the destination	fct_payments has no day or hour column being used as a partition	Backfilling a historical day duplicates that day rather than replacing it

After: The Refactored Pipeline

	# Refactored payments ETL
	# Idempotent in reads, idempotent in writes, supports backfill

	def run(run_date):
	window_start = datetime.combine(run_date, time.min)
	window_end = window_start + timedelta(days=1)

	payments = stripe_api.list_payments(
	created_after=window_start,
	created_before=window_end,
	)
	customers = postgres.fetch("SELECT id, email, country FROM customers")
	customer_map = {c['id']: c for c in customers}

	enriched = [
	{
	**p,
	'payment_date': run_date.isoformat(),
	'customer_email': customer_map.get(p['customer_id'], {}).get('email'),
	'customer_country': customer_map.get(p['customer_id'], {}).get('country'),
	}
	for p in payments
	]

	snowflake.execute("BEGIN")
	snowflake.execute(
	"DELETE FROM fct_payments WHERE payment_date = ?",
	(run_date.isoformat(),),
	)
	snowflake.execute_many(
	"INSERT INTO fct_payments VALUES (?, ?, ?, ?, ?, ?, ?)",
	[(r['id'], r['amount'], r['customer_id'], r['customer_email'],
	r['customer_country'], r['created_at'], r['payment_date'])
	for r in enriched],
	)
	snowflake.execute("COMMIT")

What Changed, Line by Line

Change	Pattern Applied	Effect
run() now takes a run_date parameter	Explicit time bounds	Backfill becomes possible; the date is no longer derived from the wall clock
list_payments uses created_after and created_before from the parameter	Explicit time bounds	Window is fixed by parameter; same date produces same payments every run
fct_payments gains a payment_date column populated from run_date	Partition key	Destination is now logically partitioned; backfill replaces by date
INSERT replaced with DELETE-then-INSERT in a transaction	Pattern 3: DELETE-then-INSERT	Retries replace the partition; duplicates are structurally impossible
Customers fetched fresh each run	Acceptable	Customer table is small; fetching current state is correct for SCD type 1 dim joining

The Backfill Now Works

	# Backfill 30 days of payments
	from datetime import date, timedelta

	start = date(2026, 3, 26)
	end = date(2026, 4, 25)

	current = start
	run_count = 0
	while current <= end:
	# In the real refactored pipeline, this would call run(run_date=current)
	# which replaces the partition for that day
	run_count += 1
	current += timedelta(days=1)

	print(f"Backfill ran {run_count} idempotent days")
	print("Running it again would produce the identical result")

The backfill loop is the same shape from the previous section, applied to the refactored pipeline. Each iteration calls the same run function the production schedule calls, with a different date. Each call replaces the partition for that day. The end state is the same as if the pipeline had run on each of those 31 days at the time. The original pipeline could not be backfilled without manual cleanup; the refactored one supports it as a feature.

What Stayed the Same

The core logic did not change. The pipeline still pulls payments, joins customers, and writes a fact table. The same columns are populated, the same joins happen, the same business meaning is produced. Idempotency is not about changing what the pipeline does; it is about changing how it writes and how it bounds its reads. The refactor adds 14 lines and removes nothing. The behavioral change is enormous; the algorithmic change is small. This is the typical signature of an idempotency refactor: small surface change, large operational consequence. The same signature appears in HTTP idempotency refactors, in distributed-lock refactors, and in any other place where the property is being added explicitly to a system that originally did not have it.

•Original Pipeline

Reads NOW() to compute the time window
INSERT-only writes; retries duplicate
No partition key on the destination
Backfill requires a custom cleanup script per request

✓Refactored Pipeline

Reads explicit run_date parameter to compute the window
DELETE-then-INSERT in a transaction; retries replace
payment_date column functions as the logical partition
Backfill is a date-range loop over the same run() function

The refactor checklist:

▸Replace every NOW() and CURRENT_DATE with an explicit parameter
▸Add a partition column to the destination if there is not one already
▸Replace INSERT with one of the three idempotent write patterns
▸Wrap multi-statement writes in a transaction
▸Test the two-run case and the backfill case before declaring the refactor done

Parameterize the date; remove every NOW() and CURRENT_DATE.

Add the partition column; convert INSERT to one of the three idempotent write patterns.

Verify with the two-run test and the backfill test before declaring done.

✓Do

Refactor in small increments; one bug at a time, each verified by the two-run test
Add a uniqueness assertion immediately after every MERGE or DELETE-then-INSERT
Document the partition key and the time-bound parameter in the pipeline contract

✗Don't

Refactor everything at once; small reversible changes are easier to validate
Skip the backfill verification because 'the production schedule will catch any bug'
Hide the partition column from consumers; it is part of the contract, not an implementation detail

TIP

When inheriting a pipeline that is suspected of being non-idempotent, the fastest diagnosis is to run it twice on the same date in a development environment and compare the destination row counts. If they differ, the pipeline needs the refactor above.

❯❯❯PUTTING IT ALL TOGETHER

> A logistics company has 400 nightly batch pipelines accumulated over four years. Finance asks for a corrected revenue figure for Q1 after a billing system migration. Two thirds of the pipelines have hardcoded NOW() calls, INSERT-only writes, and no partition keys. The new head of data engineering is asked to deliver the Q1 correction this quarter and to prevent the next backfill request from becoming a three-week project.

Step one: triage the pipelines that affect the Q1 number. Apply the refactor pattern from this lesson to each: explicit time bounds, partition keys, idempotent writes. The targeted refactor delivers the Q1 correction in days rather than weeks.

Step two: extend the refactor to the rest of the pipelines as a structural project. Pick one of the three idempotent write patterns per destination: partition overwrite for time-partitioned curated tables, MERGE on business key for dimensions, DELETE-then-INSERT for operational targets without OVERWRITE support.

Step three: enforce uniqueness on every business key. Surrogate keys generated as hashes of stable source columns are reproducible across runs; natural keys with mutable real-world identity are not. The uniqueness assertion runs after every MERGE as a quality gate.

Step four: replace every NOW() and CURRENT_DATE with explicit time bounds passed by the orchestrator. Event time becomes the canonical filter; processing time is no longer a hidden dependency. (This is the read-side counterpart to partition overwrite from Lesson 5 beginner tier.)

Step five: use the orchestrator's first-class backfill support (Airflow's catchup, Dagster partitioned jobs) once the pipelines are idempotent. The next backfill request becomes a date-range loop, not a custom engineering project. (Builds on the DAG concept from Lesson 4 and the layered architecture from Lesson 1.)

KEY TAKEAWAYS

Three idempotent write patterns cover almost everything: partition overwrite for time-partitioned data, MERGE on business key for current-state tables, DELETE-then-INSERT in a transaction for everything else.

Pick the key that is stable: surrogate keys generated as hashes of source columns survive real-world changes; natural keys do not. Enforce uniqueness with an assertion after every MERGE.

Reads can be non-idempotent too: NOW() and CURRENT_DATE inside transforms break backfills. Pipelines accept explicit time bounds and filter by event time, not processing time.

Backfill is the operational payoff: an idempotent pipeline supports backfill as a date-range loop over the production code; a non-idempotent pipeline requires a custom script per request.

The refactor is small in code and large in property: parameterize the date, add a partition column, replace INSERT with one of the three patterns, wrap in a transaction. The behavioral change is enormous; the algorithmic change is minor.

Idempotent writes, business keys, and explicit time bounds turn pipelines into replayable systems

Category: Pipeline Architecture
Difficulty: intermediate
Duration: 32 minutes
Challenges: 0 hands-on challenges

Topics covered: Three Idempotent Write Patterns, Choosing a Business Key, Reads Can Be Non-Idempotent Too, The Backfill Operation, Refactor: From ETL to Idempotent

Lesson Sections

Three Idempotent Write Patterns (concepts: paIdempotency)
Three write patterns cover the vast majority of idempotent batch pipelines. Partition overwrite replaces a slice of the destination table identified by a partition key. MERGE matches incoming rows to existing rows by a business key and updates or inserts as appropriate. DELETE-then-INSERT inside a transaction clears a logical slice and writes its replacement atomically. Each pattern has a niche, and a senior engineer reaches for the right one without thinking. Reaching for the wrong one produces
Choosing a Business Key (concepts: paIdempotency)
MERGE and deduplication both depend on a key that uniquely identifies each row. The right key seems obvious until the pipeline ingests its first edge case: an order with a null user_id, two events with the same timestamp, a CDC stream that emits one event for the row before the change and another for after, sharing the same primary key, a vendor that recycles IDs after a long enough interval, a soft delete that resurfaces the same logical row weeks later. Picking the wrong key turns an idempoten
Reads Can Be Non-Idempotent Too (concepts: paIdempotency)
Most discussions of idempotency focus on writes. The hidden second half is reads. A pipeline that reads non-deterministic input cannot be idempotent in any useful sense, because the same logical run produces different output on different days. The most common offenders are SELECT NOW(), CURRENT_DATE, and any function that resolves to wall-clock time inside a transform. Other offenders include random number generators without seeds, environment variables that drift between runs, and any external
The Backfill Operation (concepts: paBackfill)
Backfill is the act of running a pipeline over historical date ranges, usually to fix data that was wrong or to populate a new pipeline with history it was not built to capture in real time. Backfill is the operational payoff of idempotency. A pipeline that is idempotent supports backfill almost for free: pass a different date range, run the pipeline, get the right answer. A pipeline that is not idempotent does not support backfill at all; running it on a historical date corrupts whatever data i
Refactor: From ETL to Idempotent (concepts: paIdempotency)
The patterns are clearer when applied to a real refactor. The pipeline below is a real-shaped daily ETL that ingests payments from a Stripe-like API, joins them to customer accounts, and writes a daily payments fact table. The original version was written in a hurry and has every common idempotency bug at once. The refactored version applies the three patterns above: partition keys, MERGE on a business key, and explicit time bounds. The diff is the worked example. Refactors of this shape are com