Data Pipeline Backfill: Idempotent Replay at Scale
Backfill is re-running a pipeline over historical data. It only works if the pipeline is idempotent, partitioned by date, and keyed so replays don't duplicate rows. Every senior DE is expected to design one from scratch.
Sketch the DAG. Idempotent backfill requires a staging landing zone, a partitioned target, and an orchestrator that can re-run one date at a time. Drag nodes and connect them.
Overview
What Pipeline Backfill is and why it matters
A backfill re-runs a pipeline over a historical date range. Reasons: fixing a bug that produced wrong data, populating a new column across history, extending coverage to a new dimension, or recovering from a failed production run.
The only way to make backfill safe is idempotency, running the pipeline twice for the same date produces the same output. That constraint drives every downstream design decision: how you partition the destination, how you key rows, how you structure the orchestrator, and how you handle late-arriving data.
Syntax
Pipeline Backfill reference
| Token | Meaning |
|---|---|
| Idempotent | Re-running produces the same result. No duplicates, no drift. Prerequisite for backfill. |
| Partition by date | Destination table is partitioned on a run_date (or event_date) column. Replay scopes to a single partition. |
| DELETE + INSERT pattern | Delete the target partition, then insert from source. Simple and universal across engines. |
| MERGE / UPSERT | Engine-native alternative to DELETE + INSERT. Keyed by (partition, natural_key). Atomic in most engines. |
| Watermark / late data | The decision boundary: after N days, a partition is finalized and no longer backfilled. Common choice: 7 days. |
Patterns
How Pipeline Backfill shows up in practice
DELETE + INSERT per partition
The universal backfill pattern. Works in every SQL engine. The partition column must be the filter key.
Interview noteThe two statements must run in a transaction, or the destination is briefly empty. Snowflake/BigQuery/Databricks all support this pattern natively.
MERGE with composite key
Single-statement alternative. Atomic. The merge key is (partition_column, natural_key), this is what makes replay idempotent.
Interview noteMERGE is cleaner but rows that existed and are now absent in source are NOT deleted. For strict partition-replace semantics, DELETE+INSERT is safer.
Parallel partition backfill (Airflow)
Airflow's per-task date parameterization makes partition-level parallelism the default. Each task handles one date; the DAG can run N in parallel.
Interview notemax_active_runs controls backfill parallelism. Too high and you overload the warehouse; too low and a 6-month backfill takes days. A common choice is 4-8.
Late-arriving data with a watermark
Some data arrives hours or days after the event. The pipeline must choose: backfill partitions within a watermark window (e.g. last 7 days), and freeze older partitions as immutable.
Interview noteThe watermark is a business decision: how long do we wait for stragglers? Stating it explicitly in your design shows you've shipped production pipelines.
Interview questions
Pipeline Backfill interview questions
Q1.Design a daily aggregation pipeline that can be backfilled for any historical date. Walk me through the design.
What they testThe full production skillset: partitioning, idempotency, orchestration, and late-data handling in one answer.
Approach1) Partition destination by run_date. 2) Write idempotent transform: DELETE WHERE run_date = :date then INSERT filtered by the same date. 3) Orchestrate in Airflow with catchup=True so backfills use the same task. 4) Define a watermark (e.g. 7 days) for late-arriving data. 5) Emit data quality checks per partition: row count, null rate, duplicate check.
Q2.A bug shipped yesterday caused the daily aggregation to double-count revenue. How do you fix history?
What they testReal incident response. The answer separates engineers who have shipped pipelines from those who haven't.
Approach1) Identify the affected date range. 2) Deploy the fix. 3) Trigger a backfill scoped to the affected dates. The idempotent DELETE+INSERT pattern wipes the bad rows and writes correct ones. 4) Re-run downstream dependencies (dashboards, downstream pipelines). 5) Post-mortem: why did the bug pass QA? Add a data quality check that would have caught it.
Q3.What does it mean for a pipeline to be idempotent? Why does backfill require it?
What they testConceptual grounding. Candidates who cannot define idempotent cannot design backfill-safe pipelines.
ApproachIdempotent means running the pipeline N times for the same input produces the same output as running it once. Backfill reruns partitions; without idempotency, each rerun produces duplicates or drift. Techniques: scope writes to a partition, DELETE+INSERT or MERGE keyed by (partition, natural_key), deterministic transforms.
Q4.Your backfill for the last 6 months crashed halfway. How do you recover without redoing completed work?
What they testOrchestration awareness. Partition-level granularity should let you resume from the failure point.
ApproachIf each partition is an independent Airflow task, completed partitions are marked success and skipped on restart. Find the last failed partition, fix whatever broke (usually warehouse capacity or source data issue), and resume. If partitions are not independent, refactor, large batch runs that can't resume are a design smell.
Q5.How do you handle late-arriving data in a partitioned pipeline?
What they testWatermark / rolling window knowledge. Senior answer includes the business trade-off.
ApproachDefine a watermark: how many days after a partition's date do we still accept updates? Typical: 3-7 days. Within the window, rebuild partitions daily. Outside the window, freeze the partition as immutable. Late data beyond the window goes into a correction partition or is dropped with an alert.
Q6.A stakeholder asks for a new column added to a table spanning 3 years of history. Walk me through the backfill.
What they testSchema evolution + backfill combined. Watch for whether the candidate alters the table or rebuilds it.
Approach1) Add the column with a default (NULL or sensible default). 2) Modify the transformation to populate the new column for the current date. 3) Trigger a backfill for the historical range. If the source data needed to compute the column doesn't exist historically, negotiate the scope: populate from the earliest date where source data exists, and document the gap.
Common mistakes
What breaks in practice
Using INSERT without DELETE
If you only INSERT on backfill, the second run produces duplicate rows. The target partition must be cleared first or the MERGE must have a proper key.
Wrong
Right
Partitioning by ingest_date instead of event_date
If the destination partition is ingest_date (when we wrote), backfilling doesn't rewrite history, it creates new rows with today's ingest_date. Queries break.
Wrong
Right
Non-transactional DELETE + INSERT
Between the DELETE and the INSERT, the target partition is empty. Dashboards querying mid-backfill show zero.
Wrong
Right
FAQ
Common questions
- What is a backfill in a data pipeline?
- A backfill is re-running a pipeline over a historical date range. Reasons include fixing a bug, populating a new column, or recovering from a failure. Safe backfill requires idempotent pipelines.
- How do I backfill without creating duplicate rows?
- Make the pipeline idempotent. The simplest pattern: partition the destination by date, then on each run, DELETE the target partition and INSERT the fresh results. Alternatively, use MERGE keyed by (partition_date, natural_key).
- How does Airflow handle backfills?
- Airflow's catchup=True triggers a DAG run for every scheduled interval between start_date and now. Each run is parameterized with its logical date. Backfill uses the same task definitions as a normal run, which is why idempotency matters.
- What is a watermark in a backfill context?
- A watermark is the cutoff age after which a partition is considered final. Before the watermark, the pipeline still rewrites the partition on each run to absorb late data. After the watermark, the partition is immutable.
- Should I use DELETE + INSERT or MERGE for backfill?
- DELETE + INSERT is simpler and gives strict partition-replace semantics. MERGE is a single atomic statement and is faster when most rows are unchanged, but it does not delete rows that disappeared in source unless you add explicit logic.
Continue your prep
Data Engineer Interview Prep, explore the full guide
50+ guides covering every round, company, role, and technology in the data engineer interview loop. Grounded in 2,817 verified interview reports across 929 companies, collected from real candidates.
Interview Rounds
By Company
- Stripe Data Engineer Interview
- Airbnb Data Engineer Interview
- Uber Data Engineer Interview
- Netflix Data Engineer Interview
- Databricks Data Engineer Interview
- Snowflake Data Engineer Interview
- Lyft Data Engineer Interview
- DoorDash Data Engineer Interview
- Instacart Data Engineer Interview
- Robinhood Data Engineer Interview
- Pinterest Data Engineer Interview
- Twitter/X Data Engineer Interview
By Role
- Senior Data Engineer Interview
- Staff Data Engineer Interview
- Principal Data Engineer Interview
- Junior Data Engineer Interview
- Entry-Level Data Engineer Interview
- Analytics Engineer Interview
- ML Data Engineer Interview
- Streaming Data Engineer Interview
- GCP Data Engineer Interview
- AWS Data Engineer Interview
- Azure Data Engineer Interview