Loading lesson...

The Incremental Loading Question

The universal follow-up to every pipeline design question

Category: Pipeline Architecture
Difficulty: intermediate
Duration: 35 minutes
Challenges: 0 hands-on challenges

Topics covered: Full Refresh or Incremental?, How Do You Capture Changes?, What About Schema Changes?, How Do You Backfill?, How Do You Track History?

Lesson Sections

Full Refresh or Incremental? (concepts: paFullVsIncremental)
The interviewer is not asking you to pick one. They are testing whether you can reason about the tradeoff. The trap is saying "incremental, obviously" without acknowledging that every pipeline starts as a full refresh and that full refresh is correct until it is not. The Cost Crossover Your answer should frame this as a cost function. Full refresh cost grows linearly with source table size. Incremental has fixed overhead - change detection, merge logic, state tracking - plus variable cost that g
How Do You Capture Changes? (concepts: paCdc)
This is the CDC question. The interviewer wants to hear that you know exactly two families of solutions - query-based and log-based - and can articulate when each one wins. The trap is defaulting to timestamp CDC without acknowledging its blind spots. Timestamp-Based CDC The simplest CDC approach and the one most candidates reach for first. Every source table has an updated_at column, your pipeline queries WHERE updated_at > last_watermark. This works if and only if every modification to every r
What About Schema Changes? (concepts: paSchemaEvolution)
Your incremental pipeline is running smoothly. Then on Tuesday at 2 AM, the backend team adds a new column to the source table. Your pipeline either crashes, silently drops the new column, or - if you are lucky - handles it gracefully. The interviewer who asks "what about schema changes?" is testing whether you have experienced this pain. Your answer should prove that you have. The Schema Change Taxonomy Not all schema changes are equal. The interviewer wants to hear you classify them by impact
How Do You Backfill? (concepts: paBackfill, paDagOrchestration)
Backfill is the question that exposes whether your pipelines are production-grade or demo-ware. A new column is added, a bug corrupted three weeks of data, a new downstream model needs historical features - all of these require reprocessing historical data through a pipeline designed to only look forward. The interviewer wants to hear that you design for backfill from day one, not bolt it on after the first incident. Partition-Level Backfill Your answer should start here: partition-level backfil
How Do You Track History? (concepts: paScdPipeline)
A MERGE statement overwrites the target row with the latest values. The interviewer will ask: "But what if you need to know what the values were last week?" Overwriting destroys history. This is the SCD question, and your answer needs to be specific enough to prove you have implemented one. Vague hand-waving about "keeping old versions" is a red flag. SCD Type 2 in Pipelines SCD Type 2 keeps every version of a row. When a customer changes their address, you don't update the existing row - you cl