What is a backfill in a data pipeline?

A backfill is re-running a pipeline over a historical date range. Reasons include fixing a bug, populating a new column, or recovering from a failure. Safe backfill requires idempotent pipelines.

How do I backfill without creating duplicate rows?

Make the pipeline idempotent. The simplest pattern: partition the destination by date, then on each run, DELETE the target partition and INSERT the fresh results. Alternatively, use MERGE keyed by (partition_date, natural_key).

How does Airflow handle backfills?

Airflow's catchup=True triggers a DAG run for every scheduled interval between start_date and now. Each run is parameterized with its logical date. Backfill uses the same task definitions as a normal run, which is why idempotency matters.

What is a watermark in a backfill context?

A watermark is the cutoff age after which a partition is considered final. Before the watermark, the pipeline still rewrites the partition on each run to absorb late data. After the watermark, the partition is immutable.

Should I use DELETE + INSERT or MERGE for backfill?

DELETE + INSERT is simpler and gives strict partition-replace semantics. MERGE is a single atomic statement and is faster when most rows are unchanged, but it does not delete rows that disappeared in source unless you add explicit logic.

Pipeline Architecture · Interview concept

Data Pipeline Backfill: Idempotent Replay at Scale

Backfill is re-running a pipeline over historical data. It only works if the pipeline is idempotent, partitioned by date, and keyed so replays don't duplicate rows. Every senior DE is expected to design one from scratch.

Try itDesign a backfillable daily pipeline

drag nodes • draw edges

Sketch the DAG. Idempotent backfill requires a staging landing zone, a partitioned target, and an orchestrator that can re-run one date at a time. Drag nodes and connect them.

Loading canvas...

Overview

What Pipeline Backfill is and why it matters

A backfill re-runs a pipeline over a historical date range. Reasons: fixing a bug that produced wrong data, populating a new column across history, extending coverage to a new dimension, or recovering from a failed production run.

The only way to make backfill safe is idempotency, running the pipeline twice for the same date produces the same output. That constraint drives every downstream design decision: how you partition the destination, how you key rows, how you structure the orchestrator, and how you handle late-arriving data.

Syntax

Pipeline Backfill reference

Token	Meaning
Idempotent	Re-running produces the same result. No duplicates, no drift. Prerequisite for backfill.
Partition by date	Destination table is partitioned on a run_date (or event_date) column. Replay scopes to a single partition.
DELETE + INSERT pattern	Delete the target partition, then insert from source. Simple and universal across engines.
MERGE / UPSERT	Engine-native alternative to DELETE + INSERT. Keyed by (partition, natural_key). Atomic in most engines.
Watermark / late data	The decision boundary: after N days, a partition is finalized and no longer backfilled. Common choice: 7 days.

Prepare for the interview

01 / Open invite

02min.

Know Pipeline Backfill the way the interviewer who asks it knows it.

a Pipeline Backfill query, the same shape a screen would give you.

The diff against expected. Where ties broke. What you missed.

sandbox

1source → bronze → silver → gold

2 ingest : CDC + Kafka

3 transform : dbt + Airflow

4 serve : Snowflake

Execute your solution0.4s avg.

DatabricksInterview question

Solve a Pipeline Backfill problem

Patterns

How Pipeline Backfill shows up in practice

DELETE + INSERT per partition

The universal backfill pattern. Works in every SQL engine. The partition column must be the filter key.

	DELETE
	FROM fact_daily_orders
	WHERE run_date = : DATE ; INSERT INTO fact_daily_orders(run_date, customer_id, orders, revenue)

	SELECT
	order_date,
	customer_id,
	COUNT(*),
	SUM(amount)
	FROM stg_orders
	WHERE order_date = : DATE
	GROUP BY order_date, customer_id ;

Interview noteThe two statements must run in a transaction, or the destination is briefly empty. Snowflake/BigQuery/Databricks all support this pattern natively.

MERGE with composite key

Single-statement alternative. Atomic. The merge key is (partition_column, natural_key), this is what makes replay idempotent.

	MERGE INTO fact_daily_orders tgt
	USING (
	SELECT order_date, customer_id, COUNT (*) AS orders, SUM (amount) AS revenue
	FROM stg_orders
	WHERE order_date = : DATE
	GROUP BY order_date, customer_id
	) src
	ON tgt.run_date = src.order_date AND tgt.customer_id = src.customer_id
	WHEN MATCHED THEN UPDATE SET orders = src.orders, revenue = src.revenue
	WHEN NOT MATCHED THEN INSERT (run_date, customer_id, orders, revenue
	) VALUES (src.order_date, src.customer_id, src.orders, src.revenue
	)

Interview noteMERGE is cleaner but rows that existed and are now absent in source are NOT deleted. For strict partition-replace semantics, DELETE+INSERT is safer.

Parallel partition backfill (Airflow)

Airflow's per-task date parameterization makes partition-level parallelism the default. Each task handles one date; the DAG can run N in parallel.

	from airflow.decorators import task, dag
	from datetime import datetime

	@dag(schedule="@daily", catchup=True, max_active_runs=4,
	start_date=datetime(2025, 1, 1))
	def daily_orders_pipeline():

	@task
	def rebuild_partition(run_date: str):
	# Each task scoped to exactly one partition.
	# Re-runnable without side effects.
	execute_sql(f"DELETE FROM fact_daily_orders WHERE run_date = '{run_date}'")
	execute_sql(f"INSERT INTO fact_daily_orders ... WHERE order_date = '{run_date}'")

	rebuild_partition("{{ ds }}")

Interview notemax_active_runs controls backfill parallelism. Too high and you overload the warehouse; too low and a 6-month backfill takes days. A common choice is 4-8.

Late-arriving data with a watermark

Some data arrives hours or days after the event. The pipeline must choose: backfill partitions within a watermark window (e.g.

	DELETE
	FROM fact_daily_orders
	WHERE run_date >= CURRENT_DATE - INTERVAL '7 days' ; INSERT INTO fact_daily_orders(run_date, customer_id, orders, revenue)

	SELECT
	order_date,
	customer_id,
	COUNT(*),
	SUM(amount)
	FROM stg_orders
	WHERE order_date >= CURRENT_DATE - INTERVAL '7 days'
	GROUP BY order_date, customer_id ;

Interview noteThe watermark is a business decision: how long do we wait for stragglers? Stating it explicitly in your design shows you've shipped production pipelines.

Interview questions

Pipeline Backfill interview questions

Q1.Design a daily aggregation pipeline that can be backfilled for any historical date. Walk me through the design.

What they testThe full production skillset: partitioning, idempotency, orchestration, and late-data handling in one answer.

Approach1) Partition destination by run_date. 2) Write idempotent transform: DELETE WHERE run_date = :date then INSERT filtered by the same date. 3) Orchestrate in Airflow with catchup=True so backfills use the same task. 4) Define a watermark (e.g. 7 days) for late-arriving data. 5) Emit data quality checks per partition: row count, null rate, duplicate check.

Q2.A bug shipped yesterday caused the daily aggregation to double-count revenue. How do you fix history?

What they testReal incident response. The answer separates engineers who have shipped pipelines from those who haven't.

Approach1) Identify the affected date range. 2) Deploy the fix. 3) Trigger a backfill scoped to the affected dates. The idempotent DELETE+INSERT pattern wipes the bad rows and writes correct ones. 4) Re-run downstream dependencies (dashboards, downstream pipelines). 5) Post-mortem: why did the bug pass QA? Add a data quality check that would have caught it.

Q3.What does it mean for a pipeline to be idempotent? Why does backfill require it?

What they testConceptual grounding. Candidates who cannot define idempotent cannot design backfill-safe pipelines.

ApproachIdempotent means running the pipeline N times for the same input produces the same output as running it once. Backfill reruns partitions; without idempotency, each rerun produces duplicates or drift. Techniques: scope writes to a partition, DELETE+INSERT or MERGE keyed by (partition, natural_key), deterministic transforms.

Q4.Your backfill for the last 6 months crashed halfway. How do you recover without redoing completed work?

What they testOrchestration awareness. Partition-level granularity should let you resume from the failure point.

ApproachIf each partition is an independent Airflow task, completed partitions are marked success and skipped on restart. Find the last failed partition, fix whatever broke (usually warehouse capacity or source data issue), and resume. If partitions are not independent, refactor, large batch runs that can't resume are a design smell.

Q5.How do you handle late-arriving data in a partitioned pipeline?

What they testWatermark / rolling window knowledge. Senior answer includes the business trade-off.

ApproachDefine a watermark: how many days after a partition's date do we still accept updates? Typical: 3-7 days. Within the window, rebuild partitions daily. Outside the window, freeze the partition as immutable. Late data beyond the window goes into a correction partition or is dropped with an alert.

Q6.A stakeholder asks for a new column added to a table spanning 3 years of history. Walk me through the backfill.

What they testSchema evolution + backfill combined. Watch for whether the candidate alters the table or rebuilds it.

Approach1) Add the column with a default (NULL or sensible default). 2) Modify the transformation to populate the new column for the current date. 3) Trigger a backfill for the historical range. If the source data needed to compute the column doesn't exist historically, negotiate the scope: populate from the earliest date where source data exists, and document the gap.

Common mistakes

What breaks in practice

Using INSERT without DELETE

If you only INSERT on backfill, the second run produces duplicate rows. The target partition must be cleared first or the MERGE must have a proper key.

Wrong

	INSERT INTO fact_daily_orders(run_date, customer_id, orders, revenue)
	SELECT
	order_date,
	customer_id,
	COUNT(*),
	SUM(amount)
	FROM stg_orders
	WHERE order_date = : DATE
	GROUP BY order_date, customer_id ;

Right

	DELETE
	FROM fact_daily_orders
	WHERE run_date = : DATE ; INSERT INTO fact_daily_orders(run_date, customer_id, orders, revenue)
	SELECT
	order_date,
	customer_id,
	COUNT(*),
	SUM(amount)
	FROM stg_orders
	WHERE order_date = : DATE
	GROUP BY order_date, customer_id ;

Partitioning by ingest_date instead of event_date

If the destination partition is ingest_date (when we wrote), backfilling doesn't rewrite history, it creates new rows with today's ingest_date. Queries break.

Wrong

	INSERT INTO fact_daily_orders
	SELECT
	CURRENT_DATE AS run_date,
	...
	FROM stg_orders
	WHERE order_date = : DATE ;

Right

	INSERT INTO fact_daily_orders
	SELECT
	: DATE AS run_date,
	...
	FROM stg_orders
	WHERE order_date = : DATE ;

Non-transactional DELETE + INSERT

Between the DELETE and the INSERT, the target partition is empty. Dashboards querying mid-backfill show zero.

Wrong

	DELETE
	FROM fact_daily_orders
	WHERE run_date = : DATE ; INSERT INTO fact_daily_orders...;

Right

	BEGIN ; DELETE
	FROM fact_daily_orders
	WHERE run_date = : DATE ; INSERT INTO fact_daily_orders...; COMMIT ;

Prepare for the interview

03 / From the bank03 of many

03hand-picked.

Mark to Market

Medium30 min

Open the sandbox

Pick your level

OracleL3Disappearing Ink PayPalL4The Clicks We Throw Away IntuitL5SaaS API Connector with Incremental Sync

Pulled from debriefs where system design separated levels.

FAQ

Common questions

What is a backfill in a data pipeline?: A backfill is re-running a pipeline over a historical date range. Reasons include fixing a bug, populating a new column, or recovering from a failure. Safe backfill requires idempotent pipelines.
How do I backfill without creating duplicate rows?: Make the pipeline idempotent. The simplest pattern: partition the destination by date, then on each run, DELETE the target partition and INSERT the fresh results. Alternatively, use MERGE keyed by (partition_date, natural_key).
How does Airflow handle backfills?: Airflow's catchup=True triggers a DAG run for every scheduled interval between start_date and now. Each run is parameterized with its logical date. Backfill uses the same task definitions as a normal run, which is why idempotency matters.
What is a watermark in a backfill context?: A watermark is the cutoff age after which a partition is considered final. Before the watermark, the pipeline still rewrites the partition on each run to absorb late data. After the watermark, the partition is immutable.
Should I use DELETE + INSERT or MERGE for backfill?: DELETE + INSERT is simpler and gives strict partition-replace semantics. MERGE is a single atomic statement and is faster when most rows are unchanged, but it does not delete rows that disappeared in source unless you add explicit logic.

Interview Prep

Data Engineer Interview Prep Guide→

50+ guides covering every round, company, and role

Interview Rounds

SQL Round Python Round System Design Data Modeling Behavioral Live Coding

By Company

Stripe Airbnb Netflix Uber Databricks Snowflake

By Role

Senior DE Staff DE Analytics Engineer ML Data Engineer Junior DE

Question Sets

Top 100 Questions FAANG Questions SQL Questions 50 Questions Take-Home Examples

Published 2026-04-20Updated 2026-04-20

What Pipeline Backfill is and why it matters

Pipeline Backfill reference

Know Pipeline Backfill the way the interviewer who asks it knows it.

How Pipeline Backfill shows up in practice

DELETE + INSERT per partition

MERGE with composite key

Parallel partition backfill (Airflow)

Late-arriving data with a watermark

Pipeline Backfill interview questions

Q1.Design a daily aggregation pipeline that can be backfilled for any historical date. Walk me through the design.

Q2.A bug shipped yesterday caused the daily aggregation to double-count revenue. How do you fix history?

Q3.What does it mean for a pipeline to be idempotent? Why does backfill require it?

Q4.Your backfill for the last 6 months crashed halfway. How do you recover without redoing completed work?

Q5.How do you handle late-arriving data in a partitioned pipeline?

Q6.A stakeholder asks for a new column added to a table spanning 3 years of history. Walk me through the backfill.

What breaks in practice

Using INSERT without DELETE

Partitioning by ingest_date instead of event_date

Non-transactional DELETE + INSERT

Related concepts

Mark to Market

Common questions