Keeping Data Fresh: Beginner
You designed the pipeline. The interviewer nods. Then: 'How do you keep it up to date?' This is the most common follow-up in pipeline interviews. They want to hear full refresh vs incremental, change data capture, and what happens when you need to reprocess history. Here is exactly how to answer.
Full vs Incremental Loading
Explain full vs incremental loading and pick the right one
When you hear these in an interview, this is the concept being tested
- ▸"Do you reload everything or just the changes?"
- ▸"How do you handle incremental updates?"
- ▸"Full refresh vs incremental: when do you use each?"
What They Want to Hear
- Simplest to implement: SELECT * and overwrite
- Guaranteed correct: no missed updates
- Scales poorly: 1 billion rows takes hours
- Safe default for small tables under 10M rows
- Faster: only process new or changed rows
- Cheaper: less compute, less I/O
- More complex: needs a change marker (timestamp or log)
- Can miss hard deletes without CDC
After your initial answer, expect these probes
- ▸"What if the source has no updated_at column?" Full refresh is the only safe option. Or push the source team to add CDC. Some teams hash each row and compare, but that costs almost as much as a full reload.
- ▸"What about deleted rows?" Incremental with timestamps cannot detect hard deletes. You need CDC or soft deletes with a deleted_at flag.
- ▸"When would full refresh actually be better?" Small tables, tables with no reliable change marker, or after schema changes that invalidate old data.
Change Data Capture
Explain CDC and why it beats timestamp-based loading
When you hear these in an interview, this is the concept being tested
- ▸"How do you capture changes from the source database?"
- ▸"What is CDC?"
- ▸"How do you detect deleted rows?"
What They Want to Hear
| Method | Catches Inserts | Catches Updates | Catches Deletes | Source Impact |
|---|---|---|---|---|
| Timestamp filter | Yes | Yes | No | Query load on source |
| CDC (log-based) | Yes | Yes | Yes | Reads the WAL; minimal impact |
| Full refresh | Yes | Yes | Yes | Heaviest: reads entire table |
The Tools to Name-Drop
| Tool | Type | One-Liner for Interviews |
|---|---|---|
| Debezium | Open source | Reads database WAL and streams changes to Kafka |
| Fivetran / Airbyte | Managed | Handles CDC automatically; you configure the connection |
| AWS DMS | Cloud-native | Database Migration Service with ongoing replication mode |
After your initial answer, expect these probes
- ▸"What is a WAL?" Write-Ahead Log. The database writes every change to a log before applying it. CDC tools read this log to capture changes without querying the table directly.
- ▸"Does CDC add load to the source database?" Log-based CDC adds almost none. It reads a file the database already writes. Trigger-based CDC adds significant overhead because it runs code on every row change.
- ▸"When would you NOT use CDC?" When the source is a flat file, an API, or a system that does not expose a change log. CDC is specifically for databases.
Slowly Changing Dimensions
Answer SCD questions and explain the Type 2 join pattern
When you hear these in an interview, this is the concept being tested
- ▸"How do you track historical changes?"
- ▸"A customer changes their address. How does your pipeline handle that?"
- ▸"What is SCD Type 2?"
What They Want to Hear
| SCD Type | What Happens on Change | Use When |
|---|---|---|
| Type 1 | Overwrite the old value | History does not matter (e.g., fixing a typo) |
| Type 2 | Keep old row, insert new row with dates | History matters (e.g., customer address, pricing) |
| Type 3 | Add a column for the previous value | Only need one prior value (rare in practice) |
After your initial answer, expect these probes
- ▸"How do you join a fact table to an SCD Type 2 dimension?" Join on the business key AND filter where the fact's event_date falls between the dimension's start_date and end_date. This gives you the dimension values that were true at the time of the event.
- ▸"Type 2 tables grow forever. How do you manage that?" In practice, dimension tables grow slowly. A customer table with 10M customers and 3 changes per customer averages 30M rows. That is manageable. Partition by is_current for fast lookups.
- ▸"When would you use Type 1 instead?" When history does not matter. Fixing a misspelling, correcting a data entry error, or updating a non-analytical attribute.
Schema Evolution
Classify schema changes as safe or breaking and handle each
When you hear these in an interview, this is the concept being tested
- ▸"The source team adds a new column. What happens to your pipeline?"
- ▸"How do you handle schema changes?"
- ▸"What breaks when the source schema changes?"
What They Want to Hear
- New column added to source
- New enum value in a category field
- New table added to source schema
- Action: auto-add, notify downstream
- Column renamed or removed
- Data type changed (string to int)
- Primary key definition changed
- Action: halt pipeline, alert, plan migration
After your initial answer, expect these probes
- ▸"How do you detect schema drift automatically?" Compare the source schema to the last known schema before processing. Tools like Great Expectations or custom checks can validate column names and types on each run.
- ▸"What if you need to backfill after a schema change?" That depends on whether historical data needs the new column. If yes, you backfill with a default value or re-extract from source. If no, you apply the change going forward only.
- ▸"How do file formats handle schema evolution?" Parquet and Avro both support adding columns without breaking existing readers. But removing or renaming columns requires rewriting files.
Backfilling
Explain how to backfill safely and prioritize correctly
When you hear these in an interview, this is the concept being tested
- ▸"You found a bug in your transformation. How do you fix historical data?"
- ▸"How do you backfill a table?"
- ▸"A pipeline was down for 3 days. What now?"
What They Want to Hear
After your initial answer, expect these probes
- ▸"How do you backfill without impacting production queries?" Process during off-peak hours. Write to a staging table first, then swap it in. Or use partition-level REPLACE so only one partition is unavailable at a time.
- ▸"What if the source data is no longer available for the backfill period?" This is why data retention policies matter. If the raw data in your landing zone is deleted after 30 days, you cannot backfill beyond that window.
- ▸"How do you prioritize: backfill old data or keep processing new data?" New data first, always. Run backfills in a separate pipeline or lower-priority queue so current processing is not blocked.
Answer the incremental loading question that follows every pipeline design
- Category
- Pipeline Architecture
- Difficulty
- beginner
- Duration
- 20 minutes
- Challenges
- 0 hands-on challenges
Topics covered: Full vs Incremental Loading, Change Data Capture, Slowly Changing Dimensions, Schema Evolution, Backfilling
Lesson Sections
- Full vs Incremental Loading (concepts: paFullVsIncremental)
What They Want to Hear 'Full refresh drops the entire table and reloads from scratch. Incremental only processes rows that changed since the last run. I default to incremental because it is faster and cheaper, but I run a full refresh weekly as a safety net.' That is the answer. Two strategies, a default choice, and the safety valve.
- Change Data Capture (concepts: paCdc)
What They Want to Hear 'CDC reads the database's own change log to capture every insert, update, and delete. Unlike timestamp-based incremental loading, CDC catches deletes and does not miss rows that changed between runs.' That is the answer. CDC solves the two biggest weaknesses of basic incremental loading: missed deletes and missed in-between changes. The Tools to Name-Drop
- Slowly Changing Dimensions (concepts: paScdPipeline)
What They Want to Hear 'SCD Type 2 keeps a full history of changes. When a value changes, I close the current row by setting an end date, and insert a new row with the updated value and a new start date. This means I can always answer: what was the customer's address when they placed that order last March?' That is the answer. SCD is about preserving history so you can join facts to the dimension values that were true at the time.
- Schema Evolution (concepts: paSchemaEvolution)
What They Want to Hear 'I classify schema changes as additive or breaking. Adding a new column is additive and should be handled automatically. Renaming or removing a column is breaking and requires a migration plan. My pipeline detects schema drift on each run and either auto-adapts for additive changes or alerts the team for breaking ones.' That is the framework. Additive vs breaking. Auto-handle vs alert.
- Backfilling (concepts: paBackfill)
What They Want to Hear 'Backfilling means reprocessing historical data, usually because a bug corrupted it or a pipeline was down. I backfill by re-running the pipeline for a specific date range. The pipeline must be idempotent so that re-running produces the same result as running once. I process one partition at a time to avoid overloading the system.' That is the answer. Backfill = re-run, idempotency = safety, partition-by-partition = control.