Concepts

Idempotent: What It Means for Data Engineers

Idempotent means running an operation multiple times produces the same result as running it once. For data engineers, this is not a theoretical concept. It is the difference between a pipeline that recovers gracefully from failures and one that corrupts your data every time it retries. This page covers what idempotency means, why pipelines must be idempotent, the patterns that achieve it, and the interview questions that test your understanding.

40%

Pipeline Failures Fixable

1983

Term Coined

172

L6 Staff Questions

1,042

Interviews Analyzed

Source: DataDriven analysis of 1,042 verified data engineering interview rounds.

Why Idempotency Matters

Four real-world scenarios where non-idempotent pipelines break. Each one is preventable with the right write pattern.

Pipeline Failure Recovery

A pipeline fails halfway through. Some rows were written, others were not. You fix the bug and re-run. If the pipeline is not idempotent, the already-written rows get duplicated. Now your data is wrong in a different way. An idempotent pipeline produces the same result whether it runs once or ten times. You re-run with confidence.

Backfills

Your stakeholder needs data re-processed for the last 90 days because the transformation logic changed. You re-run the pipeline for each day. If the pipeline is not idempotent, each re-run appends to existing data instead of replacing it. Ninety days of duplicated rows. An idempotent pipeline overwrites or merges cleanly, so backfills work without manual cleanup.

Scheduler Retries

Airflow retries a failed task 3 times by default. If the task is not idempotent, each retry appends partial results. After 3 retries, you might have tripled data for the rows that succeeded. Idempotent tasks can be retried safely because repeated execution does not change the outcome beyond the first successful run.

Concurrent Execution

Two instances of the same pipeline run at the same time because of a scheduler bug or a manual trigger during an automated run. If the pipeline is not idempotent, both instances write their data, producing duplicates. An idempotent pipeline with proper locking or MERGE semantics handles this gracefully.

Idempotent Write Patterns

Four patterns that make pipeline writes idempotent. Each one ensures that re-running with the same input produces the same target state.

MERGE / Upsert

Match on a unique key. If the row exists, update it. If it does not, insert it. This is the most common idempotent write pattern for dimension tables and entity tables. Re-running the same MERGE with the same source data produces the same target state every time.

When to use: Dimension tables, entity tables, any table with unique identifiers where you want the latest version of each row.

MERGE INTO dim_customer AS target
USING staging_customer AS source
  ON target.customer_id = source.customer_id
WHEN MATCHED THEN
  UPDATE SET
    name = source.name,
    email = source.email,
    updated_at = CURRENT_TIMESTAMP
WHEN NOT MATCHED THEN
  INSERT (customer_id, name, email, created_at, updated_at)
  VALUES (source.customer_id, source.name, source.email,
          CURRENT_TIMESTAMP, CURRENT_TIMESTAMP);

DELETE + INSERT

Delete all rows for a given partition or key range, then insert the fresh data. The combination is idempotent because running it again deletes what was just inserted and re-inserts the same rows. This pattern is simpler than MERGE and works well for fact tables with natural partitions like date.

When to use: Fact tables partitioned by date, event tables with a clear time boundary, any table where replacing a chunk is easier than merging individual rows.

-- Idempotent daily load
BEGIN;
  DELETE FROM fact_orders
  WHERE order_date = '2026-04-11';

  INSERT INTO fact_orders (order_id, order_date, customer_id, amount)
  SELECT order_id, order_date, customer_id, amount
  FROM staging_orders
  WHERE order_date = '2026-04-11';
COMMIT;

Partition Overwrite

In Spark and Hive, INSERT OVERWRITE replaces an entire partition atomically. The old partition data is deleted and replaced with the new data in one operation. This is the idempotent pattern for large-scale data lakes. It is cleaner than DELETE + INSERT because the overwrite is atomic at the storage level.

When to use: Data lakes on S3/GCS/ADLS, Spark pipelines writing to Hive-partitioned tables, Databricks Delta Lake tables.

-- Spark partition overwrite
df.write \
  .mode("overwrite") \
  .partitionBy("order_date") \
  .saveAsTable("fact_orders")

-- Only the partitions present in df are overwritten
-- Other partitions remain untouched
-- Re-running with the same data produces the same result

INSERT ON CONFLICT (PostgreSQL)

PostgreSQL's INSERT ... ON CONFLICT DO UPDATE is a single-statement upsert. It attempts an INSERT and, if a unique constraint is violated, updates the existing row instead. This is more concise than MERGE and achieves the same idempotent behavior in one statement.

When to use: PostgreSQL-specific workloads, application-level upserts, small to medium batch loads.

INSERT INTO dim_product (product_id, name, category, updated_at)
VALUES ('SKU-001', 'Widget Pro', 'Hardware', NOW())
ON CONFLICT (product_id) DO UPDATE SET
  name = EXCLUDED.name,
  category = EXCLUDED.category,
  updated_at = EXCLUDED.updated_at;

-- First run: inserts the row
-- Second run: updates the same row with identical values
-- Result is the same either way

Non-Idempotent Anti-Patterns

These patterns break idempotency. They are common in early-stage pipelines and technical debt. Recognizing them is the first step to fixing them.

Bare INSERT without deduplication

A plain INSERT INTO ... SELECT appends rows every time it runs. If the pipeline runs twice, the data is doubled. If it runs three times, tripled. This is the most common idempotency violation and the easiest to prevent.

-- NOT idempotent
INSERT INTO fact_orders (order_id, order_date, amount)
SELECT order_id, order_date, amount
FROM staging_orders;
-- Running this twice doubles the data

Fix: Switch to MERGE, DELETE + INSERT, or INSERT ON CONFLICT. If using Spark, use .mode('overwrite') instead of .mode('append').

Incrementing a counter without a check

Updating a running total by adding to it (SET count = count + batch_size) is not idempotent. Each execution adds the increment again. The counter drifts further from the correct value with every re-run.

-- NOT idempotent
UPDATE metrics SET page_views = page_views + 1500
WHERE metric_date = '2026-04-11';
-- Each re-run adds another 1500

-- Idempotent alternative
UPDATE metrics SET page_views = 1500
WHERE metric_date = '2026-04-11';

Fix: Replace the increment with an absolute value. Compute the correct total from source data and SET count = computed_total. Alternatively, use a ledger pattern where individual increments are rows in a table, and the total is always a SUM over the ledger.

3 Idempotency Interview Questions

These questions test whether you understand idempotency beyond the definition. Each one probes practical application and debugging skills.

Q1: What does idempotent mean in the context of data pipelines?

What they test:

Foundational understanding. The interviewer wants to hear that an idempotent pipeline produces the same output regardless of how many times it runs with the same input. The key phrase is 'same result.' Mention that this enables safe retries, backfills, and failure recovery. Give a concrete example of a non-idempotent operation (bare INSERT) and its idempotent alternative (DELETE + INSERT or MERGE).

Approach:

Define it in one sentence: 'Running the pipeline N times produces the same result as running it once.' Then give a practical example with a fact table load. Explain why it matters: retries, backfills, and concurrent execution all depend on idempotency.

Q2: How would you make this pipeline idempotent? It currently does INSERT INTO fact_events SELECT * FROM staging_events WHERE event_date = :date.

What they test:

Applied problem-solving. The current pipeline appends rows on every run. The interviewer wants you to propose a fix. There are three common approaches and the right one depends on the table structure and volume.

Approach:

Option 1: DELETE + INSERT. Delete fact_events where event_date = :date, then insert from staging. Wrap in a transaction. Option 2: MERGE on event_id (if a unique key exists). Option 3: Partition overwrite if using Spark or a data lake. Recommend DELETE + INSERT for this case because fact tables are typically partitioned by date and the pattern is simple and clear.

Q3: You discover that a pipeline has been running twice daily for a week due to a scheduler misconfiguration. The pipeline is not idempotent. How do you fix the data?

What they test:

Incident response and debugging. The interviewer wants to see your approach to identifying and removing duplicate data, then preventing the issue from recurring.

Approach:

Step 1: Identify the duplicates. Use ROW_NUMBER() OVER (PARTITION BY natural_key ORDER BY ingested_at) to find rows where rn > 1. Step 2: Delete the duplicates (keep the earliest or latest depending on the use case). Step 3: Make the pipeline idempotent using one of the patterns above. Step 4: Fix the scheduler to prevent double runs. Step 5: Add a data quality check that alerts on unexpected row count increases.

Idempotency FAQ

What does idempotent mean?+
Idempotent means that performing an operation multiple times produces the same result as performing it once. In mathematics, f(f(x)) = f(x). In data engineering, an idempotent pipeline writes the same output regardless of how many times it runs with the same input. This property is critical because pipelines fail, get retried, and get re-run for backfills. Without idempotency, each re-run compounds errors by duplicating or corrupting data.
Why must data pipelines be idempotent?+
Pipelines fail. Schedulers retry tasks automatically. Business requirements change and you need to backfill historical data. If a pipeline is not idempotent, retries produce duplicate rows, backfills corrupt existing data, and concurrent runs create inconsistencies. Idempotent pipelines can be safely re-run at any time without manual cleanup. This is not a nice-to-have; it is a requirement for any production data system.
What is the difference between idempotent and deterministic?+
A deterministic function always produces the same output for the same input. An idempotent function produces the same result when applied multiple times. They are related but different. A pipeline that inserts rows with created_at = NOW() is deterministic in its business logic (same input, same business output) but not strictly deterministic in its timestamps. It can still be idempotent if it uses MERGE or DELETE + INSERT to prevent duplicates. Idempotency is about the final state of the target, not every intermediate value.

Build Bulletproof Pipelines

Idempotency is a requirement for every production pipeline. Practice designing idempotent loads, handling failure recovery, and explaining your approach in mock interviews on DataDriven.