Pipeline Design

Idempotent Data Pipelines: What Interviewers Want to Hear

Idempotency means re-running a pipeline produces the same result. It is the single most important design property for production data pipelines because it makes failure recovery, backfills, and retries safe. Interviewers test it because it separates candidates who have operated real pipelines from those who have only built them.

Why Idempotency Matters

Every pipeline fails. The question is what happens when you re-run it.

Pipeline Failure Recovery

A pipeline fails halfway through. Some rows were written, others were not. You fix the bug and re-run. If the pipeline is not idempotent, the already-written rows get duplicated. Now your data is wrong in a different way. An idempotent pipeline produces the same result whether it runs once or ten times.

Backfills

Your stakeholder needs data re-processed for the last 90 days because the transformation logic changed. You re-run the pipeline for each day. If the pipeline is not idempotent, each re-run appends to existing data instead of replacing it. 90 days of duplicated rows. An idempotent pipeline overwrites or merges cleanly.

Reruns and Retries

Airflow retries a failed task 3 times. If the task is not idempotent, each retry appends partial results. After 3 retries, you might have tripled data for the rows that succeeded. Idempotent tasks can be retried safely because repeated execution does not change the outcome.

Concurrent Execution

Two instances of the same pipeline run at the same time (scheduler bug, manual trigger during an automated run). If the pipeline is not idempotent, both instances write their data, producing duplicates. An idempotent pipeline with proper locking or MERGE semantics handles this gracefully.

Four Implementation Patterns

Each pattern achieves idempotency differently. Know all four so you can choose the right one for the use case.

MERGE / Upsert

Match on a unique key. If the row exists, update it. If it does not, insert it. This is the most common idempotent write pattern. It works for dimension tables, slowly changing dimensions, and any table with a natural or surrogate key.

When to use: Dimension tables, entity tables, any table where rows have unique identifiers and you want the latest version.

MERGE INTO dim_customer AS target
USING staging_customer AS source
  ON target.customer_id = source.customer_id
WHEN MATCHED THEN
  UPDATE SET
    name = source.name,
    email = source.email,
    updated_at = CURRENT_TIMESTAMP
WHEN NOT MATCHED THEN
  INSERT (customer_id, name, email, created_at, updated_at)
  VALUES (source.customer_id, source.name, source.email,
          CURRENT_TIMESTAMP, CURRENT_TIMESTAMP);

Watch out: MERGE in some engines is not atomic for concurrent executions. Two MERGE statements running simultaneously on the same target can produce duplicates if they both evaluate the WHEN NOT MATCHED clause for the same key. Use table-level locks or serialized execution to prevent this.

Partition Overwrite

Delete all data in the target partition, then insert the new data. The partition is typically a date (daily, hourly). This approach is simple and reliable: each run completely replaces the partition, so re-runs produce exactly the same result. No key-matching logic needed.

When to use: Fact tables partitioned by date. Daily or hourly pipelines where you can afford to recompute the entire partition. This is the most common pattern for batch ELT in data warehouses.

-- Step 1: Delete the partition
DELETE FROM fact_orders
WHERE order_date = '2024-06-15';

-- Step 2: Insert fresh data
INSERT INTO fact_orders (order_id, customer_id, amount, order_date)
SELECT order_id, customer_id, amount, order_date
FROM staging_orders
WHERE order_date = '2024-06-15';

-- In some warehouses (BigQuery, Spark):
-- INSERT OVERWRITE fact_orders PARTITION (order_date = '2024-06-15')
-- SELECT ... FROM staging;

Watch out: The DELETE + INSERT is not atomic in most engines. If the pipeline fails between DELETE and INSERT, you lose data. Wrap in a transaction or use INSERT OVERWRITE which is atomic. In BigQuery, use MERGE or write disposition WRITE_TRUNCATE on the partition.

Tombstone and Replace

Instead of deleting data, mark it as superseded (soft delete) and insert new versions. Each record has a version or batch ID. Downstream queries filter to the latest version. This preserves history and avoids destructive operations.

When to use: When you need audit trails, when compliance requires you to keep all versions, or when the target system does not support DELETE efficiently (some data lakes).

-- Step 1: Mark existing records as superseded
UPDATE fact_orders
SET is_current = FALSE, superseded_at = CURRENT_TIMESTAMP
WHERE order_date = '2024-06-15'
  AND is_current = TRUE;

-- Step 2: Insert new records as current
INSERT INTO fact_orders (order_id, customer_id, amount, order_date,
                         is_current, batch_id, loaded_at)
SELECT order_id, customer_id, amount, order_date,
       TRUE, 'batch_20240615_v2', CURRENT_TIMESTAMP
FROM staging_orders
WHERE order_date = '2024-06-15';

Watch out: Downstream queries must always filter WHERE is_current = TRUE. If someone forgets the filter, they get duplicates. Enforce this with a view that wraps the table and applies the filter automatically.

INSERT ... ON CONFLICT (PostgreSQL Upsert)

PostgreSQL's native upsert before MERGE was available (MERGE was added in v15). Specify the conflict target (usually a unique constraint) and what to do when a conflict is detected: update the existing row or do nothing.

When to use: PostgreSQL environments, especially when loading data incrementally from a staging table or an API.

INSERT INTO dim_product (product_id, name, price, updated_at)
SELECT product_id, name, price, NOW()
FROM staging_product
ON CONFLICT (product_id) DO UPDATE SET
  name = EXCLUDED.name,
  price = EXCLUDED.price,
  updated_at = EXCLUDED.updated_at;

Watch out: ON CONFLICT requires a unique constraint or unique index on the conflict columns. If the constraint does not exist, the clause has no effect and you get duplicates. Always verify the constraint exists before relying on ON CONFLICT.

Idempotency Anti-Patterns

Patterns that look correct but break on re-runs.

INSERT without dedup

Problem: A pipeline runs INSERT INTO target SELECT ... FROM source. Re-running it appends the same rows again. After 3 runs, you have 3x the data.

Fix: Use MERGE or partition overwrite instead. If you must use INSERT, add a deduplication step: DELETE from target before INSERT, or use INSERT ... ON CONFLICT.

Relying on source timestamps for dedup

Problem: A pipeline skips rows where source.updated_at <= target.max_updated_at. This fails when the source clock drifts, when rows are updated without changing the timestamp, or when late-arriving data has old timestamps.

Fix: Use MERGE on the natural key instead of timestamp-based filtering. Timestamps are useful for optimization (reducing the scan window) but not reliable as the sole deduplication mechanism.

Non-atomic delete-then-insert

Problem: DELETE all rows, then INSERT new rows, without a transaction. If the INSERT fails, you have an empty table. Data loss.

Fix: Wrap DELETE + INSERT in a transaction. Or use INSERT OVERWRITE (atomic in BigQuery and Spark). Or use a temp table swap: insert into a temp table, then rename it to replace the original.

Ignoring partial failure in multi-step pipelines

Problem: A pipeline updates three tables. The first two succeed, the third fails. On re-run, the pipeline processes all three again. The first two tables get duplicate updates.

Fix: Make each step independently idempotent. Use MERGE for each table so re-running a step that already succeeded is a no-op. Or use checkpointing: track which steps completed and skip them on retry.

7 Idempotency Interview Questions

These test whether you can design, implement, and reason about idempotent pipelines.

Q1: What does idempotency mean in the context of data pipelines?

What they test:

Basic definition. Re-running a pipeline with the same input produces the same output, regardless of how many times it runs. The state of the target is the same after 1 run or 100 runs.

Approach:

Give the definition, then explain why it matters: failure recovery, backfills, and retries all require idempotency. Without it, every pipeline failure risks data corruption.

Q2: How would you make this INSERT-based pipeline idempotent?

What they test:

Practical application. The interviewer gives you a pipeline that does INSERT INTO target SELECT FROM source and asks you to make it safe for re-runs.

Approach:

Three options: (1) MERGE on the natural key. (2) DELETE + INSERT wrapped in a transaction. (3) Partition overwrite if the data is partitioned by date. Discuss which option fits the use case and why.

Q3: Explain the difference between MERGE and INSERT OVERWRITE. When would you use each?

What they test:

Pattern selection. MERGE is row-level: match on key, update or insert. INSERT OVERWRITE is partition-level: replace all data in the partition.

Approach:

MERGE for dimension tables where you update individual rows. INSERT OVERWRITE for fact tables where you recompute entire partitions. MERGE preserves rows not in the source. INSERT OVERWRITE removes everything in the partition and replaces it.

Q4: Your pipeline ran three times due to a scheduler bug. How do you fix the duplicated data?

What they test:

Incident response and design awareness. The interviewer wants to see how you clean up AND how you prevent recurrence.

Approach:

If the pipeline was idempotent, there is nothing to fix. If it was not: identify duplicates using a unique key or batch_id, delete the extras (keep the earliest or latest based on loaded_at), then fix the pipeline to be idempotent so this cannot happen again.

Q5: How do you backfill 90 days of data in a production pipeline?

What they test:

Whether your pipeline design supports safe backfills. An idempotent pipeline can be re-run for any date range. A non-idempotent pipeline creates duplicates.

Approach:

Parameterize the pipeline by date. Re-run it for each of the 90 dates. If the pipeline uses partition overwrite or MERGE, each re-run safely replaces the data for that date. Discuss parallelization: can you run 90 backfills concurrently, or do they need to run sequentially?

Q6: Write a MERGE statement that handles both inserts and updates for a slowly changing dimension.

What they test:

SQL MERGE syntax and SCD awareness. The interviewer may specify SCD Type 1 (overwrite) or Type 2 (add new row, expire old one).

Approach:

For Type 1: MERGE with UPDATE on match, INSERT on no match. For Type 2: when matched AND source values differ from target, expire the old row (set end_date, is_current = FALSE) and insert a new row. This requires two steps or a MERGE with both UPDATE and INSERT in the MATCHED branch.

Q7: What are the risks of using MERGE in a concurrent environment?

What they test:

Deep understanding of MERGE semantics. In some engines, concurrent MERGE statements can produce duplicates when both evaluate the NOT MATCHED branch for the same key simultaneously.

Approach:

Discuss the race condition: two transactions both check for key = 123, both find it missing, both insert. The result: two rows for key = 123. Solutions: table-level locks, serialized execution, or using INSERT ... ON CONFLICT instead of MERGE in PostgreSQL.

Idempotency FAQ

What does idempotent mean in data engineering?+
An idempotent pipeline produces the same result regardless of how many times it runs with the same input. Running it once or ten times leaves the target in the same state. This property makes failure recovery, backfills, and retries safe.
Is MERGE the best way to achieve idempotency?+
MERGE is the most common approach for row-level idempotency (dimension tables, entity tables). Partition overwrite is simpler and more appropriate for fact tables partitioned by date. The best approach depends on your table structure and update pattern.
Does every pipeline need to be idempotent?+
In production, yes. Every pipeline will fail eventually. When it does, you need to re-run it safely. Non-idempotent pipelines create data quality issues on every failure and make backfills dangerous. The upfront cost of making a pipeline idempotent is small compared to the cost of cleaning up duplicates in production.
How does idempotency relate to exactly-once processing?+
They are related but different. Exactly-once processing means each event is processed exactly one time. Idempotency means processing an event more than once produces the same result as processing it once. In practice, exactly-once is achieved through at-least-once delivery plus idempotent processing. The idempotent sink guarantees that duplicate deliveries do not corrupt the output.
What is the difference between idempotent and deterministic?+
A deterministic pipeline produces the same output given the same input. An idempotent pipeline produces the same target state regardless of how many times it runs. A pipeline can be deterministic but not idempotent (if it appends results on each run). And a pipeline can be idempotent but not deterministic (if it uses CURRENT_TIMESTAMP, each run produces different timestamps but the same logical state via MERGE).

Build Pipelines That Survive Failure

Idempotency is the foundation of reliable data engineering. Every other design decision (monitoring, alerting, backfills) is easier when your pipelines are idempotent. Practice the patterns with real SQL.

Practice SQL Problems