What Does Idempotent Mean? Idempotency for Data Engineers

Idempotent means running an operation multiple times produces the same result as running it once. For data engineers, this isn't theoretical. It's the difference between a pipeline that recovers gracefully from failures and one that corrupts your data every time it retries.

What this guide actually says

Idempotent means running an operation N times produces the same result as running it once. For data pipelines, this isn't theoretical. It's the difference between a pipeline that recovers gracefully from failures and one that corrupts your data every time it retries. The four patterns that make pipelines idempotent: MERGE, DELETE+INSERT, partition overwrite, INSERT ON CONFLICT. The most common anti-pattern is a bare INSERT without deduplication.

Why idempotency matters

Four real-world scenarios where non-idempotent pipelines break. Each is preventable with the right write pattern.

Pipeline failure recovery

A pipeline fails halfway through. Some rows were written, others were not. You fix the bug and re-run. If the pipeline isn't idempotent, the already-written rows get duplicated. Now your data is wrong in a different way. An idempotent pipeline produces the same result whether it runs once or ten times.

Backfills

Your stakeholder needs data re-processed for the last 90 days because the transformation logic changed. You re-run the pipeline for each day. If the pipeline isn't idempotent, each re-run appends to existing data instead of replacing it. Ninety days of duplicated rows. An idempotent pipeline overwrites or merges cleanly.

Scheduler retries

Airflow retries a failed task 3 times by default. If the task isn't idempotent, each retry appends partial results. After 3 retries, you might have tripled data for the rows that succeeded. Idempotent tasks can be retried safely because repeated execution doesn't change the outcome beyond the first successful run.

Concurrent execution

Two instances of the same pipeline run at the same time because of a scheduler bug or a manual trigger during an automated run. If the pipeline isn't idempotent, both instances write their data, producing duplicates. An idempotent pipeline with proper locking or MERGE semantics handles this gracefully.

Idempotent write patterns

Four patterns that make pipeline writes idempotent. Each ensures re-running with the same input produces the same target state.

MERGE / Upsert

Match on a unique key. If the row exists, update it. If it doesn't, insert it. Most common idempotent write pattern for dimension and entity tables. Re-running with the same source produces the same target state every time. Use for: dimension tables, entity tables, any table with unique identifiers where you want the latest version of each row.

DELETE + INSERT

Delete all rows for a given partition or key range, then insert the fresh data. The combination is idempotent because running it again deletes what was just inserted and re-inserts the same rows. Simpler than MERGE; works well for fact tables with natural partitions like date. Use for: fact tables partitioned by date, event tables with a clear time boundary, any table where replacing a chunk is easier than merging individual rows.

Partition overwrite

In Spark and Hive, INSERT OVERWRITE replaces an entire partition atomically. The old partition data is deleted and replaced with the new in one operation. The idempotent pattern for large-scale data lakes. Cleaner than DELETE + INSERT because the overwrite is atomic at the storage level. Use for: data lakes on S3/GCS/ADLS, Spark pipelines writing to Hive-partitioned tables, Databricks Delta Lake tables.

INSERT ON CONFLICT (PostgreSQL)

PostgreSQL's INSERT ... ON CONFLICT DO UPDATE is a single-statement upsert. Attempts an INSERT and, if a unique constraint is violated, updates the existing row instead. More concise than MERGE; same idempotent behavior in one statement. Use for: PostgreSQL-specific workloads, application-level upserts, small to medium batch loads.

MERGE / Upsert pattern

MERGE INTO dim_customer AS target
USING staging_customer AS source
  ON target.customer_id = source.customer_id
WHEN MATCHED THEN
  UPDATE SET
    name = source.name,
    email = source.email,
    updated_at = CURRENT_TIMESTAMP
WHEN NOT MATCHED THEN
  INSERT (customer_id, name, email, created_at, updated_at)
  VALUES (source.customer_id, source.name, source.email,
          CURRENT_TIMESTAMP, CURRENT_TIMESTAMP);

Match on unique key. Update if exists, insert if not. Re-running produces the same state.

DELETE + INSERT pattern

-- Idempotent daily load
BEGIN;
  DELETE FROM fact_orders
  WHERE order_date = '2026-04-11';

  INSERT INTO fact_orders (order_id, order_date, customer_id, amount)
  SELECT order_id, order_date, customer_id, amount
  FROM staging_orders
  WHERE order_date = '2026-04-11';
COMMIT;

Replace an entire partition atomically inside a transaction.

Partition overwrite (Spark)

# Spark partition overwrite
df.write \
  .mode("overwrite") \
  .partitionBy("order_date") \
  .saveAsTable("fact_orders")

# Only the partitions present in df are overwritten
# Other partitions remain untouched
# Re-running with the same data produces the same result

Atomic partition replacement at the storage level. Other partitions remain untouched.

INSERT ON CONFLICT (PostgreSQL)

INSERT INTO dim_product (product_id, name, category, updated_at)
VALUES ('SKU-001', 'Widget Pro', 'Hardware', NOW())
ON CONFLICT (product_id) DO UPDATE SET
  name = EXCLUDED.name,
  category = EXCLUDED.category,
  updated_at = EXCLUDED.updated_at;

-- First run: inserts the row
-- Second run: updates the same row with identical values
-- Result is the same either way

Single-statement upsert. More concise than MERGE for application-level work.

Non-idempotent anti-patterns

Common in early-stage pipelines and technical debt. Recognizing them is the first step to fixing them.

Bare INSERT without deduplication

A plain INSERT INTO ... SELECT appends rows every time it runs. If the pipeline runs twice, data is doubled. Three times, tripled. The most common idempotency violation and the easiest to prevent. Fix: switch to MERGE, DELETE + INSERT, or INSERT ON CONFLICT. If using Spark, use .mode('overwrite') instead of .mode('append').

Incrementing a counter without a check

Updating a running total by adding to it (SET count = count + batch_size) is not idempotent. Each execution adds the increment again. The counter drifts further from the correct value with every re-run. Fix: replace the increment with an absolute value. Compute the correct total from source data and SET count = computed_total. Alternatively, use a ledger pattern where individual increments are rows and the total is always a SUM over the ledger.

Anti-pattern: bare INSERT

-- NOT idempotent
INSERT INTO fact_orders (order_id, order_date, amount)
SELECT order_id, order_date, amount
FROM staging_orders;
-- Running this twice doubles the data

Anti-pattern: counter increment

-- NOT idempotent
UPDATE metrics SET page_views = page_views + 1500
WHERE metric_date = '2026-04-11';
-- Each re-run adds another 1500

-- Idempotent alternative
UPDATE metrics SET page_views = 1500
WHERE metric_date = '2026-04-11';

Idempotency interview questions

Three questions that test whether you understand idempotency beyond the definition.

Q01

What does idempotent mean in the context of data pipelines?

An idempotent pipeline produces the same output regardless of how many times it runs with the same input. Key phrase: 'same result.' Enables safe retries, backfills, failure recovery. Give a concrete example of a non-idempotent operation (bare INSERT) and its idempotent alternative (DELETE + INSERT or MERGE).

Q02

How would you make this pipeline idempotent? INSERT INTO fact_events SELECT * FROM staging_events WHERE event_date = :date.

Three common approaches: (1) DELETE + INSERT: delete fact_events where event_date = :date, then insert from staging. Wrap in a transaction. (2) MERGE on event_id (if a unique key exists). (3) Partition overwrite if using Spark or a data lake. Recommend DELETE + INSERT for this case because fact tables are typically partitioned by date and the pattern is simple and clear.

Q03

A pipeline has been running twice daily for a week due to a scheduler misconfiguration. Not idempotent. Fix the data.

Step 1: identify duplicates with ROW_NUMBER() OVER (PARTITION BY natural_key ORDER BY ingested_at) to find rows where rn > 1. Step 2: delete duplicates (keep earliest or latest depending on use case). Step 3: make the pipeline idempotent using MERGE, DELETE + INSERT, or partition overwrite. Step 4: fix the scheduler. Step 5: add a data quality check that alerts on unexpected row count increases.

Idempotency FAQ

What does idempotent mean?+

Performing an operation multiple times produces the same result as performing it once. In math, f(f(x)) = f(x). In data engineering, an idempotent pipeline writes the same output regardless of how many times it runs with the same input. Critical because pipelines fail, get retried, and get re-run for backfills.

Why must data pipelines be idempotent?+

Pipelines fail. Schedulers retry tasks automatically. Business requirements change and you need to backfill historical data. Without idempotency, retries produce duplicate rows, backfills corrupt existing data, concurrent runs create inconsistencies. Idempotent pipelines can be safely re-run at any time without manual cleanup. Not a nice-to-have. It's a requirement.

What's the difference between idempotent and deterministic?+

A deterministic function always produces the same output for the same input. An idempotent function produces the same result when applied multiple times. Related but different. A pipeline that inserts rows with created_at = NOW() is deterministic in its business logic but not strictly deterministic in its timestamps. It can still be idempotent if it uses MERGE or DELETE + INSERT to prevent duplicates. Idempotency is about the final state of the target, not every intermediate value.

02 / Why practice

Build bulletproof pipelines

01
Active recall beats re-reading by 50%
Cognitive-science meta-reviews (Dunlosky et al., 2013) rank practice testing as a top-tier study technique, while re-reading and highlighting rank near the bottom
02
76% of hiring managers reject on the coding task, not the resume
From HackerRank's 2024 Developer Skills Report. Candidates who look strong on paper still fail the live screen if they haven't done timed, executable practice
03
System design is graded on the calls you defend out loud
Ingestion, batch vs streaming, the bronze/silver/gold layers, idempotency, backfill and replay. Sketching the pipeline and naming the failure modes is the signal, not the boxes

Open the problems

Related guides

Idempotent Pipelines Deep Dive→

MERGE, partition overwrite, and tombstone patterns with interview questions.

Pipeline Architecture→

Batch, streaming, and hybrid architectures for data pipelines.

System Design Interview→

End-to-end data system design with trade-off analysis.