Pipeline Architecture · Interview concept

Data Pipeline Backfill: Idempotent Replay at Scale

Backfill is re-running a pipeline over historical data. It only works if the pipeline is idempotent, partitioned by date, and keyed so replays don't duplicate rows. Every senior DE is expected to design one from scratch.

Try itDesign a backfillable daily pipeline
drag nodes • draw edges

Sketch the DAG. Idempotent backfill requires a staging landing zone, a partitioned target, and an orchestrator that can re-run one date at a time. Drag nodes and connect them.

Loading canvas...

Overview

What Pipeline Backfill is and why it matters

A backfill re-runs a pipeline over a historical date range. Reasons: fixing a bug that produced wrong data, populating a new column across history, extending coverage to a new dimension, or recovering from a failed production run.

The only way to make backfill safe is idempotency, running the pipeline twice for the same date produces the same output. That constraint drives every downstream design decision: how you partition the destination, how you key rows, how you structure the orchestrator, and how you handle late-arriving data.

Syntax

Pipeline Backfill reference

TokenMeaning
IdempotentRe-running produces the same result. No duplicates, no drift. Prerequisite for backfill.
Partition by dateDestination table is partitioned on a run_date (or event_date) column. Replay scopes to a single partition.
DELETE + INSERT patternDelete the target partition, then insert from source. Simple and universal across engines.
MERGE / UPSERTEngine-native alternative to DELETE + INSERT. Keyed by (partition, natural_key). Atomic in most engines.
Watermark / late dataThe decision boundary: after N days, a partition is finalized and no longer backfilled. Common choice: 7 days.

Patterns

How Pipeline Backfill shows up in practice

DELETE + INSERT per partition

The universal backfill pattern. Works in every SQL engine. The partition column must be the filter key.

1DELETE
2FROM fact_daily_orders
3WHERE run_date = : DATE ; INSERT INTO fact_daily_orders(run_date, customer_id, orders, revenue)
4
5SELECT
6 order_date,
7 customer_id,
8 COUNT(*),
9 SUM(amount)
10FROM stg_orders
11WHERE order_date = : DATE
12GROUP BY order_date, customer_id ;

Interview noteThe two statements must run in a transaction, or the destination is briefly empty. Snowflake/BigQuery/Databricks all support this pattern natively.

MERGE with composite key

Single-statement alternative. Atomic. The merge key is (partition_column, natural_key), this is what makes replay idempotent.

1MERGE INTO fact_daily_orders tgt
2USING (
3 SELECT order_date, customer_id, COUNT (*) AS orders, SUM (amount) AS revenue
4 FROM stg_orders
5 WHERE order_date = : DATE
6 GROUP BY order_date, customer_id
7) src
8ON tgt.run_date = src.order_date AND tgt.customer_id = src.customer_id
9WHEN MATCHED THEN UPDATE SET orders = src.orders, revenue = src.revenue
10WHEN NOT MATCHED THEN INSERT (run_date, customer_id, orders, revenue
11) VALUES (src.order_date, src.customer_id, src.orders, src.revenue
12)

Interview noteMERGE is cleaner but rows that existed and are now absent in source are NOT deleted. For strict partition-replace semantics, DELETE+INSERT is safer.

Parallel partition backfill (Airflow)

Airflow's per-task date parameterization makes partition-level parallelism the default. Each task handles one date; the DAG can run N in parallel.

1from airflow.decorators import task, dag
2from datetime import datetime
3
4@dag(schedule="@daily", catchup=True, max_active_runs=4,
5 start_date=datetime(2025, 1, 1))
6def daily_orders_pipeline():
7
8 @task
9 def rebuild_partition(run_date: str):
10 # Each task scoped to exactly one partition.
11 # Re-runnable without side effects.
12 execute_sql(f"DELETE FROM fact_daily_orders WHERE run_date = '{run_date}'")
13 execute_sql(f"INSERT INTO fact_daily_orders ... WHERE order_date = '{run_date}'")
14
15 rebuild_partition("{{ ds }}")

Interview notemax_active_runs controls backfill parallelism. Too high and you overload the warehouse; too low and a 6-month backfill takes days. A common choice is 4-8.

Late-arriving data with a watermark

Some data arrives hours or days after the event. The pipeline must choose: backfill partitions within a watermark window (e.g. last 7 days), and freeze older partitions as immutable.

1DELETE
2FROM fact_daily_orders
3WHERE run_date >= CURRENT_DATE - INTERVAL '7 days' ; INSERT INTO fact_daily_orders(run_date, customer_id, orders, revenue)
4
5SELECT
6 order_date,
7 customer_id,
8 COUNT(*),
9 SUM(amount)
10FROM stg_orders
11WHERE order_date >= CURRENT_DATE - INTERVAL '7 days'
12GROUP BY order_date, customer_id ;

Interview noteThe watermark is a business decision: how long do we wait for stragglers? Stating it explicitly in your design shows you've shipped production pipelines.

Interview questions

Pipeline Backfill interview questions

Q1.Design a daily aggregation pipeline that can be backfilled for any historical date. Walk me through the design.

What they testThe full production skillset: partitioning, idempotency, orchestration, and late-data handling in one answer.

Approach1) Partition destination by run_date. 2) Write idempotent transform: DELETE WHERE run_date = :date then INSERT filtered by the same date. 3) Orchestrate in Airflow with catchup=True so backfills use the same task. 4) Define a watermark (e.g. 7 days) for late-arriving data. 5) Emit data quality checks per partition: row count, null rate, duplicate check.

Q2.A bug shipped yesterday caused the daily aggregation to double-count revenue. How do you fix history?

What they testReal incident response. The answer separates engineers who have shipped pipelines from those who haven't.

Approach1) Identify the affected date range. 2) Deploy the fix. 3) Trigger a backfill scoped to the affected dates. The idempotent DELETE+INSERT pattern wipes the bad rows and writes correct ones. 4) Re-run downstream dependencies (dashboards, downstream pipelines). 5) Post-mortem: why did the bug pass QA? Add a data quality check that would have caught it.

Q3.What does it mean for a pipeline to be idempotent? Why does backfill require it?

What they testConceptual grounding. Candidates who cannot define idempotent cannot design backfill-safe pipelines.

ApproachIdempotent means running the pipeline N times for the same input produces the same output as running it once. Backfill reruns partitions; without idempotency, each rerun produces duplicates or drift. Techniques: scope writes to a partition, DELETE+INSERT or MERGE keyed by (partition, natural_key), deterministic transforms.

Q4.Your backfill for the last 6 months crashed halfway. How do you recover without redoing completed work?

What they testOrchestration awareness. Partition-level granularity should let you resume from the failure point.

ApproachIf each partition is an independent Airflow task, completed partitions are marked success and skipped on restart. Find the last failed partition, fix whatever broke (usually warehouse capacity or source data issue), and resume. If partitions are not independent, refactor, large batch runs that can't resume are a design smell.

Q5.How do you handle late-arriving data in a partitioned pipeline?

What they testWatermark / rolling window knowledge. Senior answer includes the business trade-off.

ApproachDefine a watermark: how many days after a partition's date do we still accept updates? Typical: 3-7 days. Within the window, rebuild partitions daily. Outside the window, freeze the partition as immutable. Late data beyond the window goes into a correction partition or is dropped with an alert.

Q6.A stakeholder asks for a new column added to a table spanning 3 years of history. Walk me through the backfill.

What they testSchema evolution + backfill combined. Watch for whether the candidate alters the table or rebuilds it.

Approach1) Add the column with a default (NULL or sensible default). 2) Modify the transformation to populate the new column for the current date. 3) Trigger a backfill for the historical range. If the source data needed to compute the column doesn't exist historically, negotiate the scope: populate from the earliest date where source data exists, and document the gap.

Common mistakes

What breaks in practice

Using INSERT without DELETE

If you only INSERT on backfill, the second run produces duplicate rows. The target partition must be cleared first or the MERGE must have a proper key.

Wrong

1INSERT INTO fact_daily_orders(run_date, customer_id, orders, revenue)
2SELECT
3 order_date,
4 customer_id,
5 COUNT(*),
6 SUM(amount)
7FROM stg_orders
8WHERE order_date = : DATE
9GROUP BY order_date, customer_id ;

Right

1DELETE
2FROM fact_daily_orders
3WHERE run_date = : DATE ; INSERT INTO fact_daily_orders(run_date, customer_id, orders, revenue)
4SELECT
5 order_date,
6 customer_id,
7 COUNT(*),
8 SUM(amount)
9FROM stg_orders
10WHERE order_date = : DATE
11GROUP BY order_date, customer_id ;

Partitioning by ingest_date instead of event_date

If the destination partition is ingest_date (when we wrote), backfilling doesn't rewrite history, it creates new rows with today's ingest_date. Queries break.

Wrong

1INSERT INTO fact_daily_orders
2SELECT
3 CURRENT_DATE AS run_date,
4 ...
5FROM stg_orders
6WHERE order_date = : DATE ;

Right

1INSERT INTO fact_daily_orders
2SELECT
3 : DATE AS run_date,
4 ...
5FROM stg_orders
6WHERE order_date = : DATE ;

Non-transactional DELETE + INSERT

Between the DELETE and the INSERT, the target partition is empty. Dashboards querying mid-backfill show zero.

Wrong

1DELETE
2FROM fact_daily_orders
3WHERE run_date = : DATE ; INSERT INTO fact_daily_orders...;

Right

1BEGIN ; DELETE
2FROM fact_daily_orders
3WHERE run_date = : DATE ; INSERT INTO fact_daily_orders...; COMMIT ;

FAQ

Common questions

What is a backfill in a data pipeline?
A backfill is re-running a pipeline over a historical date range. Reasons include fixing a bug, populating a new column, or recovering from a failure. Safe backfill requires idempotent pipelines.
How do I backfill without creating duplicate rows?
Make the pipeline idempotent. The simplest pattern: partition the destination by date, then on each run, DELETE the target partition and INSERT the fresh results. Alternatively, use MERGE keyed by (partition_date, natural_key).
How does Airflow handle backfills?
Airflow's catchup=True triggers a DAG run for every scheduled interval between start_date and now. Each run is parameterized with its logical date. Backfill uses the same task definitions as a normal run, which is why idempotency matters.
What is a watermark in a backfill context?
A watermark is the cutoff age after which a partition is considered final. Before the watermark, the pipeline still rewrites the partition on each run to absorb late data. After the watermark, the partition is immutable.
Should I use DELETE + INSERT or MERGE for backfill?
DELETE + INSERT is simpler and gives strict partition-replace semantics. MERGE is a single atomic statement and is faster when most rows are unchanged, but it does not delete rows that disappeared in source unless you add explicit logic.

Continue your prep

Data Engineer Interview Prep, explore the full guide

50+ guides covering every round, company, role, and technology in the data engineer interview loop. Grounded in 2,817 verified interview reports across 929 companies, collected from real candidates.