Pipeline Design
Idempotency means re-running a pipeline produces the same result. It is the single most important design property for production data pipelines because it makes failure recovery, backfills, and retries safe. Interviewers test it because it separates candidates who have operated real pipelines from those who have only built them.
Every pipeline fails. The question is what happens when you re-run it.
A pipeline fails halfway through. Some rows were written, others were not. You fix the bug and re-run. If the pipeline is not idempotent, the already-written rows get duplicated. Now your data is wrong in a different way. An idempotent pipeline produces the same result whether it runs once or ten times.
Your stakeholder needs data re-processed for the last 90 days because the transformation logic changed. You re-run the pipeline for each day. If the pipeline is not idempotent, each re-run appends to existing data instead of replacing it. 90 days of duplicated rows. An idempotent pipeline overwrites or merges cleanly.
Airflow retries a failed task 3 times. If the task is not idempotent, each retry appends partial results. After 3 retries, you might have tripled data for the rows that succeeded. Idempotent tasks can be retried safely because repeated execution does not change the outcome.
Two instances of the same pipeline run at the same time (scheduler bug, manual trigger during an automated run). If the pipeline is not idempotent, both instances write their data, producing duplicates. An idempotent pipeline with proper locking or MERGE semantics handles this gracefully.
Each pattern achieves idempotency differently. Know all four so you can choose the right one for the use case.
Match on a unique key. If the row exists, update it. If it does not, insert it. This is the most common idempotent write pattern. It works for dimension tables, slowly changing dimensions, and any table with a natural or surrogate key.
When to use: Dimension tables, entity tables, any table where rows have unique identifiers and you want the latest version.
MERGE INTO dim_customer AS target
USING staging_customer AS source
ON target.customer_id = source.customer_id
WHEN MATCHED THEN
UPDATE SET
name = source.name,
email = source.email,
updated_at = CURRENT_TIMESTAMP
WHEN NOT MATCHED THEN
INSERT (customer_id, name, email, created_at, updated_at)
VALUES (source.customer_id, source.name, source.email,
CURRENT_TIMESTAMP, CURRENT_TIMESTAMP);Watch out: MERGE in some engines is not atomic for concurrent executions. Two MERGE statements running simultaneously on the same target can produce duplicates if they both evaluate the WHEN NOT MATCHED clause for the same key. Use table-level locks or serialized execution to prevent this.
Delete all data in the target partition, then insert the new data. The partition is typically a date (daily, hourly). This approach is simple and reliable: each run completely replaces the partition, so re-runs produce exactly the same result. No key-matching logic needed.
When to use: Fact tables partitioned by date. Daily or hourly pipelines where you can afford to recompute the entire partition. This is the most common pattern for batch ELT in data warehouses.
-- Step 1: Delete the partition
DELETE FROM fact_orders
WHERE order_date = '2024-06-15';
-- Step 2: Insert fresh data
INSERT INTO fact_orders (order_id, customer_id, amount, order_date)
SELECT order_id, customer_id, amount, order_date
FROM staging_orders
WHERE order_date = '2024-06-15';
-- In some warehouses (BigQuery, Spark):
-- INSERT OVERWRITE fact_orders PARTITION (order_date = '2024-06-15')
-- SELECT ... FROM staging;Watch out: The DELETE + INSERT is not atomic in most engines. If the pipeline fails between DELETE and INSERT, you lose data. Wrap in a transaction or use INSERT OVERWRITE which is atomic. In BigQuery, use MERGE or write disposition WRITE_TRUNCATE on the partition.
Instead of deleting data, mark it as superseded (soft delete) and insert new versions. Each record has a version or batch ID. Downstream queries filter to the latest version. This preserves history and avoids destructive operations.
When to use: When you need audit trails, when compliance requires you to keep all versions, or when the target system does not support DELETE efficiently (some data lakes).
-- Step 1: Mark existing records as superseded
UPDATE fact_orders
SET is_current = FALSE, superseded_at = CURRENT_TIMESTAMP
WHERE order_date = '2024-06-15'
AND is_current = TRUE;
-- Step 2: Insert new records as current
INSERT INTO fact_orders (order_id, customer_id, amount, order_date,
is_current, batch_id, loaded_at)
SELECT order_id, customer_id, amount, order_date,
TRUE, 'batch_20240615_v2', CURRENT_TIMESTAMP
FROM staging_orders
WHERE order_date = '2024-06-15';Watch out: Downstream queries must always filter WHERE is_current = TRUE. If someone forgets the filter, they get duplicates. Enforce this with a view that wraps the table and applies the filter automatically.
PostgreSQL's native upsert before MERGE was available (MERGE was added in v15). Specify the conflict target (usually a unique constraint) and what to do when a conflict is detected: update the existing row or do nothing.
When to use: PostgreSQL environments, especially when loading data incrementally from a staging table or an API.
INSERT INTO dim_product (product_id, name, price, updated_at)
SELECT product_id, name, price, NOW()
FROM staging_product
ON CONFLICT (product_id) DO UPDATE SET
name = EXCLUDED.name,
price = EXCLUDED.price,
updated_at = EXCLUDED.updated_at;Watch out: ON CONFLICT requires a unique constraint or unique index on the conflict columns. If the constraint does not exist, the clause has no effect and you get duplicates. Always verify the constraint exists before relying on ON CONFLICT.
Patterns that look correct but break on re-runs.
Problem: A pipeline runs INSERT INTO target SELECT ... FROM source. Re-running it appends the same rows again. After 3 runs, you have 3x the data.
Fix: Use MERGE or partition overwrite instead. If you must use INSERT, add a deduplication step: DELETE from target before INSERT, or use INSERT ... ON CONFLICT.
Problem: A pipeline skips rows where source.updated_at <= target.max_updated_at. This fails when the source clock drifts, when rows are updated without changing the timestamp, or when late-arriving data has old timestamps.
Fix: Use MERGE on the natural key instead of timestamp-based filtering. Timestamps are useful for optimization (reducing the scan window) but not reliable as the sole deduplication mechanism.
Problem: DELETE all rows, then INSERT new rows, without a transaction. If the INSERT fails, you have an empty table. Data loss.
Fix: Wrap DELETE + INSERT in a transaction. Or use INSERT OVERWRITE (atomic in BigQuery and Spark). Or use a temp table swap: insert into a temp table, then rename it to replace the original.
Problem: A pipeline updates three tables. The first two succeed, the third fails. On re-run, the pipeline processes all three again. The first two tables get duplicate updates.
Fix: Make each step independently idempotent. Use MERGE for each table so re-running a step that already succeeded is a no-op. Or use checkpointing: track which steps completed and skip them on retry.
These test whether you can design, implement, and reason about idempotent pipelines.
What they test:
Basic definition. Re-running a pipeline with the same input produces the same output, regardless of how many times it runs. The state of the target is the same after 1 run or 100 runs.
Approach:
Give the definition, then explain why it matters: failure recovery, backfills, and retries all require idempotency. Without it, every pipeline failure risks data corruption.
What they test:
Practical application. The interviewer gives you a pipeline that does INSERT INTO target SELECT FROM source and asks you to make it safe for re-runs.
Approach:
Three options: (1) MERGE on the natural key. (2) DELETE + INSERT wrapped in a transaction. (3) Partition overwrite if the data is partitioned by date. Discuss which option fits the use case and why.
What they test:
Pattern selection. MERGE is row-level: match on key, update or insert. INSERT OVERWRITE is partition-level: replace all data in the partition.
Approach:
MERGE for dimension tables where you update individual rows. INSERT OVERWRITE for fact tables where you recompute entire partitions. MERGE preserves rows not in the source. INSERT OVERWRITE removes everything in the partition and replaces it.
What they test:
Incident response and design awareness. The interviewer wants to see how you clean up AND how you prevent recurrence.
Approach:
If the pipeline was idempotent, there is nothing to fix. If it was not: identify duplicates using a unique key or batch_id, delete the extras (keep the earliest or latest based on loaded_at), then fix the pipeline to be idempotent so this cannot happen again.
What they test:
Whether your pipeline design supports safe backfills. An idempotent pipeline can be re-run for any date range. A non-idempotent pipeline creates duplicates.
Approach:
Parameterize the pipeline by date. Re-run it for each of the 90 dates. If the pipeline uses partition overwrite or MERGE, each re-run safely replaces the data for that date. Discuss parallelization: can you run 90 backfills concurrently, or do they need to run sequentially?
What they test:
SQL MERGE syntax and SCD awareness. The interviewer may specify SCD Type 1 (overwrite) or Type 2 (add new row, expire old one).
Approach:
For Type 1: MERGE with UPDATE on match, INSERT on no match. For Type 2: when matched AND source values differ from target, expire the old row (set end_date, is_current = FALSE) and insert a new row. This requires two steps or a MERGE with both UPDATE and INSERT in the MATCHED branch.
What they test:
Deep understanding of MERGE semantics. In some engines, concurrent MERGE statements can produce duplicates when both evaluate the NOT MATCHED branch for the same key simultaneously.
Approach:
Discuss the race condition: two transactions both check for key = 123, both find it missing, both insert. The result: two rows for key = 123. Solutions: table-level locks, serialized execution, or using INSERT ... ON CONFLICT instead of MERGE in PostgreSQL.
Idempotency is the foundation of reliable data engineering. Every other design decision (monitoring, alerting, backfills) is easier when your pipelines are idempotent. Practice the patterns with real SQL.
Practice SQL Problems