Idempotent means running an operation multiple times produces the same result as running it once. For data engineers, this is not a theoretical concept. It is the difference between a pipeline that recovers gracefully from failures and one that corrupts your data every time it retries. This page covers what idempotency means, why pipelines must be idempotent, the patterns that achieve it, and the interview questions that test your understanding.
Pipeline Failures Fixable
Term Coined
L6 Staff Questions
Interviews Analyzed
Source: DataDriven analysis of 1,042 verified data engineering interview rounds.
Four real-world scenarios where non-idempotent pipelines break. Each one is preventable with the right write pattern.
A pipeline fails halfway through. Some rows were written, others were not. You fix the bug and re-run. If the pipeline is not idempotent, the already-written rows get duplicated. Now your data is wrong in a different way. An idempotent pipeline produces the same result whether it runs once or ten times. You re-run with confidence.
Your stakeholder needs data re-processed for the last 90 days because the transformation logic changed. You re-run the pipeline for each day. If the pipeline is not idempotent, each re-run appends to existing data instead of replacing it. Ninety days of duplicated rows. An idempotent pipeline overwrites or merges cleanly, so backfills work without manual cleanup.
Airflow retries a failed task 3 times by default. If the task is not idempotent, each retry appends partial results. After 3 retries, you might have tripled data for the rows that succeeded. Idempotent tasks can be retried safely because repeated execution does not change the outcome beyond the first successful run.
Two instances of the same pipeline run at the same time because of a scheduler bug or a manual trigger during an automated run. If the pipeline is not idempotent, both instances write their data, producing duplicates. An idempotent pipeline with proper locking or MERGE semantics handles this gracefully.
Four patterns that make pipeline writes idempotent. Each one ensures that re-running with the same input produces the same target state.
Match on a unique key. If the row exists, update it. If it does not, insert it. This is the most common idempotent write pattern for dimension tables and entity tables. Re-running the same MERGE with the same source data produces the same target state every time.
When to use: Dimension tables, entity tables, any table with unique identifiers where you want the latest version of each row.
MERGE INTO dim_customer AS target
USING staging_customer AS source
ON target.customer_id = source.customer_id
WHEN MATCHED THEN
UPDATE SET
name = source.name,
email = source.email,
updated_at = CURRENT_TIMESTAMP
WHEN NOT MATCHED THEN
INSERT (customer_id, name, email, created_at, updated_at)
VALUES (source.customer_id, source.name, source.email,
CURRENT_TIMESTAMP, CURRENT_TIMESTAMP);Delete all rows for a given partition or key range, then insert the fresh data. The combination is idempotent because running it again deletes what was just inserted and re-inserts the same rows. This pattern is simpler than MERGE and works well for fact tables with natural partitions like date.
When to use: Fact tables partitioned by date, event tables with a clear time boundary, any table where replacing a chunk is easier than merging individual rows.
-- Idempotent daily load
BEGIN;
DELETE FROM fact_orders
WHERE order_date = '2026-04-11';
INSERT INTO fact_orders (order_id, order_date, customer_id, amount)
SELECT order_id, order_date, customer_id, amount
FROM staging_orders
WHERE order_date = '2026-04-11';
COMMIT;In Spark and Hive, INSERT OVERWRITE replaces an entire partition atomically. The old partition data is deleted and replaced with the new data in one operation. This is the idempotent pattern for large-scale data lakes. It is cleaner than DELETE + INSERT because the overwrite is atomic at the storage level.
When to use: Data lakes on S3/GCS/ADLS, Spark pipelines writing to Hive-partitioned tables, Databricks Delta Lake tables.
-- Spark partition overwrite
df.write \
.mode("overwrite") \
.partitionBy("order_date") \
.saveAsTable("fact_orders")
-- Only the partitions present in df are overwritten
-- Other partitions remain untouched
-- Re-running with the same data produces the same resultPostgreSQL's INSERT ... ON CONFLICT DO UPDATE is a single-statement upsert. It attempts an INSERT and, if a unique constraint is violated, updates the existing row instead. This is more concise than MERGE and achieves the same idempotent behavior in one statement.
When to use: PostgreSQL-specific workloads, application-level upserts, small to medium batch loads.
INSERT INTO dim_product (product_id, name, category, updated_at)
VALUES ('SKU-001', 'Widget Pro', 'Hardware', NOW())
ON CONFLICT (product_id) DO UPDATE SET
name = EXCLUDED.name,
category = EXCLUDED.category,
updated_at = EXCLUDED.updated_at;
-- First run: inserts the row
-- Second run: updates the same row with identical values
-- Result is the same either wayThese patterns break idempotency. They are common in early-stage pipelines and technical debt. Recognizing them is the first step to fixing them.
A plain INSERT INTO ... SELECT appends rows every time it runs. If the pipeline runs twice, the data is doubled. If it runs three times, tripled. This is the most common idempotency violation and the easiest to prevent.
-- NOT idempotent
INSERT INTO fact_orders (order_id, order_date, amount)
SELECT order_id, order_date, amount
FROM staging_orders;
-- Running this twice doubles the dataFix: Switch to MERGE, DELETE + INSERT, or INSERT ON CONFLICT. If using Spark, use .mode('overwrite') instead of .mode('append').
Updating a running total by adding to it (SET count = count + batch_size) is not idempotent. Each execution adds the increment again. The counter drifts further from the correct value with every re-run.
-- NOT idempotent
UPDATE metrics SET page_views = page_views + 1500
WHERE metric_date = '2026-04-11';
-- Each re-run adds another 1500
-- Idempotent alternative
UPDATE metrics SET page_views = 1500
WHERE metric_date = '2026-04-11';Fix: Replace the increment with an absolute value. Compute the correct total from source data and SET count = computed_total. Alternatively, use a ledger pattern where individual increments are rows in a table, and the total is always a SUM over the ledger.
These questions test whether you understand idempotency beyond the definition. Each one probes practical application and debugging skills.
What they test:
Foundational understanding. The interviewer wants to hear that an idempotent pipeline produces the same output regardless of how many times it runs with the same input. The key phrase is 'same result.' Mention that this enables safe retries, backfills, and failure recovery. Give a concrete example of a non-idempotent operation (bare INSERT) and its idempotent alternative (DELETE + INSERT or MERGE).
Approach:
Define it in one sentence: 'Running the pipeline N times produces the same result as running it once.' Then give a practical example with a fact table load. Explain why it matters: retries, backfills, and concurrent execution all depend on idempotency.
What they test:
Applied problem-solving. The current pipeline appends rows on every run. The interviewer wants you to propose a fix. There are three common approaches and the right one depends on the table structure and volume.
Approach:
Option 1: DELETE + INSERT. Delete fact_events where event_date = :date, then insert from staging. Wrap in a transaction. Option 2: MERGE on event_id (if a unique key exists). Option 3: Partition overwrite if using Spark or a data lake. Recommend DELETE + INSERT for this case because fact tables are typically partitioned by date and the pattern is simple and clear.
What they test:
Incident response and debugging. The interviewer wants to see your approach to identifying and removing duplicate data, then preventing the issue from recurring.
Approach:
Step 1: Identify the duplicates. Use ROW_NUMBER() OVER (PARTITION BY natural_key ORDER BY ingested_at) to find rows where rn > 1. Step 2: Delete the duplicates (keep the earliest or latest depending on the use case). Step 3: Make the pipeline idempotent using one of the patterns above. Step 4: Fix the scheduler to prevent double runs. Step 5: Add a data quality check that alerts on unexpected row count increases.
Idempotency is a requirement for every production pipeline. Practice designing idempotent loads, handling failure recovery, and explaining your approach in mock interviews on DataDriven.