Loading lesson...
Keeping Data Fresh
Master the incremental loading patterns that interviewers probe hardest
Master the incremental loading patterns that interviewers probe hardest
- Category
- Pipeline Architecture
- Difficulty
- beginner
- Duration
- 25 minutes
- Challenges
- 0 hands-on challenges
Topics covered: Merge Strategies, CDC Patterns, SCD in Pipelines, Schema Migration, Partition-Level Backfill
Lesson Sections
- Merge Strategies (concepts: paFullVsIncremental)
What They Want to Hear 'I pick the merge strategy based on table size and access pattern. For tables under 100M rows, MERGE/UPSERT on the primary key is straightforward and correct. For larger tables, I use partition REPLACE: delete the entire partition for the date range, then insert fresh data. This avoids the row-level matching that makes MERGE slow at scale.' This is the answer that shows you have hit the performance wall and solved it.
- CDC Patterns (concepts: paCdc)
What They Want to Hear 'I use WAL-based CDC because it has near-zero impact on the source database. Debezium reads the Postgres WAL or MySQL binlog, streams change events to Kafka, and my pipeline consumes from Kafka to apply inserts, updates, and deletes to the target. I avoid trigger-based CDC because triggers add latency to every write on the source and are fragile at scale.' This is the answer that shows you have run CDC in production and understand the operational tradeoffs.
- SCD in Pipelines (concepts: paScdPipeline)
What They Want to Hear 'I implement SCD Type 2 with a MERGE statement that does two things: when a matching row's attributes have changed, it closes the current row by setting end_date and is_current = false, and inserts a new row with the updated values. The surrogate key is a hash of the business key plus the start_date, which makes it deterministic and idempotent.' This is the answer that shows you have built SCD pipelines, not just drawn them on a whiteboard.
- Schema Migration (concepts: paSchemaEvolution)
What They Want to Hear 'I enforce backwards compatibility by default. New columns are added with a default value. Old columns are never removed in the same release as the new ones: I deprecate first, migrate consumers, then remove. For breaking changes, I version the schema and run both versions in parallel during the migration window.' This is the answer that shows you think about consumers, not just your own pipeline.
- Partition-Level Backfill (concepts: paBackfill)
What They Want to Hear 'I backfill at the partition level. Each partition is an independent unit of work: I can re-run it without affecting other partitions. In Airflow, I use the catchup feature or a dedicated backfill DAG with a configurable date range. I run backfills with lower priority than production tasks and validate each partition before moving to the next.' This is the answer that shows you have done this operationally, not theoretically.