Loading lesson...

The Incremental Loading Question

The universal follow-up to every pipeline design question

Challenges: 0 hands-on challenges

Lesson Sections

Full Refresh or Incremental? (concepts: paFullVsIncremental)
At scale, the answer is never purely one or the other. The interviewer wants to hear you design a hybrid strategy: incremental daily loads for speed, periodic full refreshes for correctness. If you only say "incremental," they will probe until you admit the failure modes. The Hybrid Pattern Incremental loads accumulate drift. A missed CDC event, a race condition in the source system, a timezone bug that shifts one hour of data into the wrong partition - these errors compound silently. A weekly f
How Do You Capture Changes? (concepts: paCdc)
The CDC question gets harder when the source is not a database with a WAL. APIs, third-party SaaS platforms, file drops, and event streams all require different change-detection strategies. The interviewer wants to hear that your CDC toolkit extends beyond Debezium. CDC for Non-Database Sources A third-party API has no write-ahead log. You cannot attach Debezium to Salesforce. Instead, you poll the API's list endpoint with a modified_since parameter and diff the results against your last snapsho
What About Schema Changes? (concepts: paSchemaEvolution, paDependencyMgmt)
Schema evolution is not a one-off problem you solve - it is a platform capability you build. The question shifts from 'how do I handle a schema change' to 'how do I make schema changes a routine, safe operation across hundreds of pipelines and dozens of teams.' The interviewer wants to hear platform thinking. Breaking vs Non-Breaking: The Taxonomy The last row is the most insidious. If a column called revenue changes from gross revenue to net revenue, the schema looks identical but every downstr
How Do You Backfill? (concepts: paBackfill, paDagOrchestration)
At scale, backfill is not an emergency operation - it is a first-class pipeline operation with its own scheduling, budgeting, and monitoring. The real question is not 'can you backfill?' but 'can you backfill 400 TB within a $2,000 compute budget without disrupting production workloads?' Your answer should demonstrate that you think about backfill as an engineering problem, not a fire drill. Petabyte-Scale Backfill Architecture A naive backfill of a 400 TB table reprocesses everything at once -
How Do You Track History? (concepts: paScdPipeline)
Standard SCD Type 2 tracks what the data looked like at each point in time. But the follow-up is: what happens when you discover the data was wrong? A customer's address was entered incorrectly in January, and legal sends a correction in March. Do you update the January record? Insert a new version backdated to January? This is where bi-temporal modeling enters the picture - and where most candidates get stuck. Late Corrections A late correction is a change that should have been applied in the p