Idempotent Data Pipelines: What Interviewers Want to Hear

Q: What does idempotent mean in data engineering?

An idempotent pipeline produces the same result regardless of how many times it runs with the same input. Running it once or ten times leaves the target in the same state. Makes failure recovery, backfills, and retries safe.

Q: Is MERGE the best way to achieve idempotency?

MERGE is the most common approach for row-level idempotency (dimension tables, entity tables). Partition overwrite is simpler and more appropriate for fact tables partitioned by date. The best approach depends on table structure and update pattern.

Q: Does every pipeline need to be idempotent?

In production, yes. Every pipeline will fail eventually. When it does, you need to re-run safely. Non-idempotent pipelines create data quality issues on every failure. The upfront cost of making a pipeline idempotent is small compared to cleaning up duplicates in production.

Q: How does idempotency relate to exactly-once processing?

Related but different. Exactly-once means each event is processed exactly one time. Idempotency means processing an event more than once produces the same result as processing it once. In practice, exactly-once is achieved through at-least-once delivery plus idempotent processing.

Q: Difference between idempotent and deterministic?

Deterministic: same output given same input. Idempotent: same target state regardless of how many times it runs. A pipeline can be deterministic but not idempotent (if it appends results on each run). And idempotent but not deterministic (if it uses CURRENT_TIMESTAMP).

Idempotency means re-running a pipeline produces the same result. The single most important design property for production pipelines because it makes failure recovery, backfills, and retries safe. Interviewers test it because it separates candidates who've operated real pipelines from those who've only built them.

What this guide actually says

Idempotency is the single most important design property for production pipelines. Every pipeline fails eventually; idempotency makes failure recovery, backfills, and retries safe. Four patterns achieve it: MERGE/upsert, partition overwrite, tombstone-and-replace, INSERT ON CONFLICT. The most common anti-pattern is INSERT without dedup. The most common interview gotcha is MERGE under concurrency.

Why idempotency matters

Every pipeline fails. The question is what happens when you re-run it.

Pipeline failure recovery

A pipeline fails halfway through. Some rows were written, others were not. You fix the bug and re-run. If the pipeline isn't idempotent, already-written rows get duplicated.

Backfills

Stakeholder needs 90 days re-processed because the transformation changed. Without idempotency, each re-run appends to existing data. 90 days of duplicates.

Reruns and retries

Airflow retries a failed task 3 times by default. Non-idempotent tasks compound the corruption with each retry. Idempotent tasks can be retried safely.

Concurrent execution

Two instances of the pipeline run at the same time (scheduler bug, manual trigger during automated run). Idempotent writes with proper locking or MERGE semantics handle this; non-idempotent writes duplicate.

Four implementation patterns

Each achieves idempotency differently. Know all four so you can choose the right one for the use case.

MERGE / Upsert

Match on a unique key. If the row exists, update it. If not, insert it. Most common idempotent write pattern for dimension and entity tables. Use for: dimension tables, entity tables, anywhere rows have unique identifiers and you want the latest version. Gotcha: MERGE in some engines is not atomic for concurrent executions. Two MERGEs running simultaneously on the same target can produce duplicates if both evaluate the WHEN NOT MATCHED clause for the same key. Use table-level locks or serialized execution to prevent.

Partition overwrite

Delete all data in the target partition, then insert the new data. Typically partitioned by date. Each run completely replaces the partition, so re-runs produce exactly the same result. No key-matching logic needed. Use for: fact tables partitioned by date; daily or hourly pipelines where you can afford to recompute the entire partition. Gotcha: DELETE + INSERT isn't atomic in most engines. If the pipeline fails between DELETE and INSERT, you lose data. Wrap in a transaction, or use INSERT OVERWRITE (atomic in BigQuery and Spark). In BigQuery, use MERGE or write disposition WRITE_TRUNCATE on the partition.

Tombstone and replace

Instead of deleting, mark data as superseded (soft delete) and insert new versions. Each record has a version or batch ID. Downstream queries filter to the latest version. Preserves history and avoids destructive operations. Use for: audit trails, compliance environments, or when the target system doesn't support DELETE efficiently (some data lakes). Gotcha: downstream queries must always filter WHERE is_current = TRUE. If someone forgets the filter, they get duplicates. Enforce this with a view that wraps the table and applies the filter automatically.

INSERT ... ON CONFLICT (PostgreSQL)

PostgreSQL's native upsert (predates MERGE, which was added in v15). Specify the conflict target (usually a unique constraint) and what to do on conflict: update or do nothing. Use for: PostgreSQL workloads, especially loading incrementally from a staging table or an API. Gotcha: ON CONFLICT requires a unique constraint or unique index on the conflict columns. If the constraint doesn't exist, the clause has no effect and you get duplicates. Always verify the constraint exists.

MERGE / Upsert example

MERGE INTO dim_customer AS target
USING staging_customer AS source
  ON target.customer_id = source.customer_id
WHEN MATCHED THEN
  UPDATE SET
    name = source.name,
    email = source.email,
    updated_at = CURRENT_TIMESTAMP
WHEN NOT MATCHED THEN
  INSERT (customer_id, name, email, created_at, updated_at)
  VALUES (source.customer_id, source.name, source.email,
          CURRENT_TIMESTAMP, CURRENT_TIMESTAMP);

Partition overwrite example

-- Step 1: Delete the partition
DELETE FROM fact_orders
WHERE order_date = '2024-06-15';

-- Step 2: Insert fresh data
INSERT INTO fact_orders (order_id, customer_id, amount, order_date)
SELECT order_id, customer_id, amount, order_date
FROM staging_orders
WHERE order_date = '2024-06-15';

-- In BigQuery/Spark, prefer:
-- INSERT OVERWRITE fact_orders PARTITION (order_date = '2024-06-15')
-- SELECT ... FROM staging;

Tombstone and replace example

-- Step 1: Mark existing as superseded
UPDATE fact_orders
SET is_current = FALSE, superseded_at = CURRENT_TIMESTAMP
WHERE order_date = '2024-06-15'
  AND is_current = TRUE;

-- Step 2: Insert new records as current
INSERT INTO fact_orders (order_id, customer_id, amount, order_date,
                         is_current, batch_id, loaded_at)
SELECT order_id, customer_id, amount, order_date,
       TRUE, 'batch_20240615_v2', CURRENT_TIMESTAMP
FROM staging_orders
WHERE order_date = '2024-06-15';

INSERT ... ON CONFLICT (PostgreSQL)

INSERT INTO dim_product (product_id, name, price, updated_at)
SELECT product_id, name, price, NOW()
FROM staging_product
ON CONFLICT (product_id) DO UPDATE SET
  name = EXCLUDED.name,
  price = EXCLUDED.price,
  updated_at = EXCLUDED.updated_at;

Anti-patterns

Patterns that look correct but break on re-runs.

INSERT without dedup

A pipeline runs INSERT INTO target SELECT ... FROM source. Re-running appends the same rows. After 3 runs, 3x the data. Fix: use MERGE or partition overwrite. If you must use INSERT, add a deduplication step: DELETE before INSERT, or use INSERT ... ON CONFLICT.

Relying on source timestamps for dedup

A pipeline skips rows where source.updated_at <= target.max_updated_at. Fails when the source clock drifts, when rows are updated without changing the timestamp, or when late-arriving data has old timestamps. Fix: use MERGE on the natural key instead of timestamp-based filtering. Timestamps are useful for optimization (reducing the scan window) but not reliable as the sole deduplication mechanism.

Non-atomic delete-then-insert

DELETE all rows, then INSERT new rows, without a transaction. If the INSERT fails, you have an empty table. Data loss. Fix: wrap in a transaction. Or use INSERT OVERWRITE (atomic in BigQuery and Spark). Or use a temp table swap: insert into a temp table, then rename it to replace the original.

Ignoring partial failure in multi-step pipelines

A pipeline updates three tables. The first two succeed, the third fails. On re-run, the pipeline processes all three again. The first two tables get duplicate updates. Fix: make each step independently idempotent. Use MERGE for each table so re-running a step that already succeeded is a no-op. Or use checkpointing: track which steps completed and skip them on retry.

7 idempotency interview questions

Test whether you can design, implement, and reason about idempotent pipelines.

Q01

What does idempotency mean in the context of data pipelines?

Re-running a pipeline with the same input produces the same output, regardless of how many times it runs. The target's state is the same after 1 run or 100 runs. Why it matters: failure recovery, backfills, and retries all require idempotency. Without it, every failure risks data corruption.

Q02

How would you make this INSERT-based pipeline idempotent?

Three options: (1) MERGE on the natural key. (2) DELETE + INSERT wrapped in a transaction. (3) Partition overwrite if the data is partitioned by date. Discuss which fits the use case and why. For a date-partitioned fact table, partition overwrite is simplest.

Q03

Explain MERGE vs INSERT OVERWRITE. When would you use each?

MERGE is row-level: match on key, update or insert. INSERT OVERWRITE is partition-level: replace all data in the partition. MERGE for dimension tables where you update individual rows. INSERT OVERWRITE for fact tables where you recompute entire partitions. MERGE preserves rows not in the source; INSERT OVERWRITE removes everything in the partition.

Q04

Your pipeline ran three times due to a scheduler bug. How do you fix the duplicated data?

If the pipeline was idempotent, nothing to fix. If not: identify duplicates using a unique key or batch_id, delete extras (keep earliest or latest based on loaded_at), then fix the pipeline to be idempotent so this can't happen again.

Q05

How do you backfill 90 days of data in a production pipeline?

Parameterize the pipeline by date. Re-run for each of the 90 dates. If the pipeline uses partition overwrite or MERGE, each re-run safely replaces the data for that date. Discuss parallelization: can you run 90 backfills concurrently, or do they need to run sequentially?

Q06

Write a MERGE statement that handles inserts and updates for an SCD.

For Type 1: MERGE with UPDATE on match, INSERT on no match. For Type 2: when matched AND source values differ from target, expire the old row (set end_date, is_current = FALSE) and insert a new row. Requires two steps or a MERGE with both UPDATE and INSERT in the MATCHED branch.

Q07

What are the risks of using MERGE in a concurrent environment?

Race condition: two transactions both check for key = 123, both find it missing, both insert. Result: two rows for key = 123. Solutions: table-level locks, serialized execution, or INSERT ... ON CONFLICT instead of MERGE in PostgreSQL.

Idempotency FAQ

What does idempotent mean in data engineering?+

An idempotent pipeline produces the same result regardless of how many times it runs with the same input. Running it once or ten times leaves the target in the same state. Makes failure recovery, backfills, and retries safe.

Is MERGE the best way to achieve idempotency?+

MERGE is the most common approach for row-level idempotency (dimension tables, entity tables). Partition overwrite is simpler and more appropriate for fact tables partitioned by date. The best approach depends on table structure and update pattern.

Does every pipeline need to be idempotent?+

In production, yes. Every pipeline will fail eventually. When it does, you need to re-run safely. Non-idempotent pipelines create data quality issues on every failure. The upfront cost of making a pipeline idempotent is small compared to cleaning up duplicates in production.

How does idempotency relate to exactly-once processing?+

Related but different. Exactly-once means each event is processed exactly one time. Idempotency means processing an event more than once produces the same result as processing it once. In practice, exactly-once is achieved through at-least-once delivery plus idempotent processing.

Difference between idempotent and deterministic?+

Deterministic: same output given same input. Idempotent: same target state regardless of how many times it runs. A pipeline can be deterministic but not idempotent (if it appends results on each run). And idempotent but not deterministic (if it uses CURRENT_TIMESTAMP).

02 / Why practice

Build pipelines that survive failure

01
Active recall beats re-reading by 50%
Cognitive-science meta-reviews (Dunlosky et al., 2013) rank practice testing as a top-tier study technique, while re-reading and highlighting rank near the bottom
02
76% of hiring managers reject on the coding task, not the resume
From HackerRank's 2024 Developer Skills Report. Candidates who look strong on paper still fail the live screen if they haven't done timed, executable practice
03
System design is graded on the calls you defend out loud
Ingestion, batch vs streaming, the bronze/silver/gold layers, idempotency, backfill and replay. Sketching the pipeline and naming the failure modes is the signal, not the boxes

Practice SQL problems

Related guides

Idempotent: Concept Page→

The shorter conceptual intro for engineers new to the term.

Pipeline Architecture→

Lambda, Kappa, event-driven, and request-driven patterns.

ETL vs ELT→

When to transform before vs after loading.