Loading lesson...
Answer the incremental loading question that follows every pipeline design
Answer the incremental loading question that follows every pipeline design
Topics covered: Full vs Incremental Loading, Change Data Capture, Slowly Changing Dimensions, Schema Evolution, Backfilling
What They Want to Hear 'Full refresh drops the entire table and reloads from scratch. Incremental only processes rows that changed since the last run. I default to incremental because it is faster and cheaper, but I run a full refresh weekly as a safety net.' That is the answer. Two strategies, a default choice, and the safety valve.
What They Want to Hear 'CDC reads the database's own change log to capture every insert, update, and delete. Unlike timestamp-based incremental loading, CDC catches deletes and does not miss rows that changed between runs.' That is the answer. CDC solves the two biggest weaknesses of basic incremental loading: missed deletes and missed in-between changes. The Tools to Name-Drop
What They Want to Hear 'SCD Type 2 keeps a full history of changes. When a value changes, I close the current row by setting an end date, and insert a new row with the updated value and a new start date. This means I can always answer: what was the customer's address when they placed that order last March?' That is the answer. SCD is about preserving history so you can join facts to the dimension values that were true at the time.
What They Want to Hear 'I classify schema changes as additive or breaking. Adding a new column is additive and should be handled automatically. Renaming or removing a column is breaking and requires a migration plan. My pipeline detects schema drift on each run and either auto-adapts for additive changes or alerts the team for breaking ones.' That is the framework. Additive vs breaking. Auto-handle vs alert.
What They Want to Hear 'Backfilling means reprocessing historical data, usually because a bug corrupted it or a pipeline was down. I backfill by re-running the pipeline for a specific date range. The pipeline must be idempotent so that re-running produces the same result as running once. I process one partition at a time to avoid overloading the system.' That is the answer. Backfill = re-run, idempotency = safety, partition-by-partition = control.