What Does Idempotent Mean? Idempotency for Data Engineers
Idempotent means running an operation multiple times produces the same result as running it once. For data engineers, this isn't theoretical — it's the difference between a pipeline that recovers gracefully from failures and one that corrupts your data every time it retries.
What this guide actually says
Idempotent means running an operation N times produces the same result as running it once. For data pipelines, this isn't theoretical — it's the difference between a pipeline that recovers gracefully from failures and one that corrupts your data every time it retries. The four patterns that make pipelines idempotent: MERGE, DELETE+INSERT, partition overwrite, INSERT ON CONFLICT. The most common anti-pattern is a bare INSERT without deduplication.
Why idempotency matters
Four real-world scenarios where non-idempotent pipelines break. Each is preventable with the right write pattern.
Pipeline failure recovery
A pipeline fails halfway through. Some rows were written, others were not. You fix the bug and re-run. If the pipeline isn't idempotent, the already-written rows get duplicated. Now your data is wrong in a different way. An idempotent pipeline produces the same result whether it runs once or ten times.
Backfills
Your stakeholder needs data re-processed for the last 90 days because the transformation logic changed. You re-run the pipeline for each day. If the pipeline isn't idempotent, each re-run appends to existing data instead of replacing it. Ninety days of duplicated rows. An idempotent pipeline overwrites or merges cleanly.
Scheduler retries
Airflow retries a failed task 3 times by default. If the task isn't idempotent, each retry appends partial results. After 3 retries, you might have tripled data for the rows that succeeded. Idempotent tasks can be retried safely because repeated execution doesn't change the outcome beyond the first successful run.
Concurrent execution
Two instances of the same pipeline run at the same time because of a scheduler bug or a manual trigger during an automated run. If the pipeline isn't idempotent, both instances write their data, producing duplicates. An idempotent pipeline with proper locking or MERGE semantics handles this gracefully.
Idempotent write patterns
Four patterns that make pipeline writes idempotent. Each ensures re-running with the same input produces the same target state.
MERGE / Upsert
Match on a unique key. If the row exists, update it. If it doesn't, insert it. Most common idempotent write pattern for dimension and entity tables. Re-running with the same source produces the same target state every time. Use for: dimension tables, entity tables, any table with unique identifiers where you want the latest version of each row.
DELETE + INSERT
Delete all rows for a given partition or key range, then insert the fresh data. The combination is idempotent because running it again deletes what was just inserted and re-inserts the same rows. Simpler than MERGE; works well for fact tables with natural partitions like date. Use for: fact tables partitioned by date, event tables with a clear time boundary, any table where replacing a chunk is easier than merging individual rows.
Partition overwrite
In Spark and Hive, INSERT OVERWRITE replaces an entire partition atomically. The old partition data is deleted and replaced with the new in one operation. The idempotent pattern for large-scale data lakes. Cleaner than DELETE + INSERT because the overwrite is atomic at the storage level. Use for: data lakes on S3/GCS/ADLS, Spark pipelines writing to Hive-partitioned tables, Databricks Delta Lake tables.
INSERT ON CONFLICT (PostgreSQL)
PostgreSQL's INSERT ... ON CONFLICT DO UPDATE is a single-statement upsert. Attempts an INSERT and, if a unique constraint is violated, updates the existing row instead. More concise than MERGE; same idempotent behavior in one statement. Use for: PostgreSQL-specific workloads, application-level upserts, small to medium batch loads.
MERGE / Upsert pattern
MERGE INTO dim_customer AS target
USING staging_customer AS source
ON target.customer_id = source.customer_id
WHEN MATCHED THEN
UPDATE SET
name = source.name,
email = source.email,
updated_at = CURRENT_TIMESTAMP
WHEN NOT MATCHED THEN
INSERT (customer_id, name, email, created_at, updated_at)
VALUES (source.customer_id, source.name, source.email,
CURRENT_TIMESTAMP, CURRENT_TIMESTAMP);Match on unique key. Update if exists, insert if not. Re-running produces the same state.
DELETE + INSERT pattern
-- Idempotent daily load
BEGIN;
DELETE FROM fact_orders
WHERE order_date = '2026-04-11';
INSERT INTO fact_orders (order_id, order_date, customer_id, amount)
SELECT order_id, order_date, customer_id, amount
FROM staging_orders
WHERE order_date = '2026-04-11';
COMMIT;Replace an entire partition atomically inside a transaction.
Partition overwrite (Spark)
# Spark partition overwrite
df.write \
.mode("overwrite") \
.partitionBy("order_date") \
.saveAsTable("fact_orders")
# Only the partitions present in df are overwritten
# Other partitions remain untouched
# Re-running with the same data produces the same resultAtomic partition replacement at the storage level. Other partitions remain untouched.
INSERT ON CONFLICT (PostgreSQL)
INSERT INTO dim_product (product_id, name, category, updated_at)
VALUES ('SKU-001', 'Widget Pro', 'Hardware', NOW())
ON CONFLICT (product_id) DO UPDATE SET
name = EXCLUDED.name,
category = EXCLUDED.category,
updated_at = EXCLUDED.updated_at;
-- First run: inserts the row
-- Second run: updates the same row with identical values
-- Result is the same either waySingle-statement upsert. More concise than MERGE for application-level work.
Non-idempotent anti-patterns
Common in early-stage pipelines and technical debt. Recognizing them is the first step to fixing them.
Bare INSERT without deduplication
A plain INSERT INTO ... SELECT appends rows every time it runs. If the pipeline runs twice, data is doubled. Three times, tripled. The most common idempotency violation and the easiest to prevent. Fix: switch to MERGE, DELETE + INSERT, or INSERT ON CONFLICT. If using Spark, use .mode('overwrite') instead of .mode('append').
Incrementing a counter without a check
Updating a running total by adding to it (SET count = count + batch_size) is not idempotent. Each execution adds the increment again. The counter drifts further from the correct value with every re-run. Fix: replace the increment with an absolute value. Compute the correct total from source data and SET count = computed_total. Alternatively, use a ledger pattern where individual increments are rows and the total is always a SUM over the ledger.
Anti-pattern: bare INSERT
-- NOT idempotent
INSERT INTO fact_orders (order_id, order_date, amount)
SELECT order_id, order_date, amount
FROM staging_orders;
-- Running this twice doubles the dataAnti-pattern: counter increment
-- NOT idempotent
UPDATE metrics SET page_views = page_views + 1500
WHERE metric_date = '2026-04-11';
-- Each re-run adds another 1500
-- Idempotent alternative
UPDATE metrics SET page_views = 1500
WHERE metric_date = '2026-04-11';Idempotency interview questions
Three questions that test whether you understand idempotency beyond the definition.
What does idempotent mean in the context of data pipelines?
An idempotent pipeline produces the same output regardless of how many times it runs with the same input. Key phrase: 'same result.' Enables safe retries, backfills, failure recovery. Give a concrete example of a non-idempotent operation (bare INSERT) and its idempotent alternative (DELETE + INSERT or MERGE).
How would you make this pipeline idempotent? INSERT INTO fact_events SELECT * FROM staging_events WHERE event_date = :date.
Three common approaches: (1) DELETE + INSERT: delete fact_events where event_date = :date, then insert from staging. Wrap in a transaction. (2) MERGE on event_id (if a unique key exists). (3) Partition overwrite if using Spark or a data lake. Recommend DELETE + INSERT for this case because fact tables are typically partitioned by date and the pattern is simple and clear.
A pipeline has been running twice daily for a week due to a scheduler misconfiguration. Not idempotent. Fix the data.
Step 1: identify duplicates with ROW_NUMBER() OVER (PARTITION BY natural_key ORDER BY ingested_at) to find rows where rn > 1. Step 2: delete duplicates (keep earliest or latest depending on use case). Step 3: make the pipeline idempotent using one of the patterns above. Step 4: fix the scheduler. Step 5: add a data quality check that alerts on unexpected row count increases.
Idempotency FAQ
What does idempotent mean?+
Why must data pipelines be idempotent?+
What's the difference between idempotent and deterministic?+
Build bulletproof pipelines
- 01
Active recall beats re-reading by 50%
Cognitive-science meta-reviews (Dunlosky et al., 2013) rank practice testing as a top-tier study technique, while re-reading and highlighting rank near the bottom
- 02
76% of hiring managers reject on the coding task, not the resume
From HackerRank's 2024 Developer Skills Report. Candidates who look strong on paper still fail the live screen if they haven't done timed, executable practice
- 03
Five problem shapes cover 80% of data engineer loops
Dedup, sessionization, top-N-per-group, slowly-changing dimensions, partition tricks. Writing the shapes by hand turns the unfamiliar into pattern recognition