Keeping Data Fresh: Beginner

You designed the pipeline. The interviewer nods. Then: 'How do you keep it up to date?' This is the most common follow-up in pipeline interviews. They want to hear full refresh vs incremental, change data capture, and what happens when you need to reprocess history. Here is exactly how to answer.

Full vs Incremental Loading

Daily Life
Interviews

Explain full vs incremental loading and pick the right one

Interview Trigger Phrases

When you hear these in an interview, this is the concept being tested

  • "Do you reload everything or just the changes?"
  • "How do you handle incremental updates?"
  • "Full refresh vs incremental: when do you use each?"

What They Want to Hear

'Full refresh drops the entire table and reloads from scratch. Incremental only processes rows that changed since the last run. I default to incremental because it is faster and cheaper, but I run a full refresh weekly as a safety net.' That is the answer. Two strategies, a default choice, and the safety valve.
Full Refresh
  • Simplest to implement: SELECT * and overwrite
  • Guaranteed correct: no missed updates
  • Scales poorly: 1 billion rows takes hours
  • Safe default for small tables under 10M rows
Incremental
  • Faster: only process new or changed rows
  • Cheaper: less compute, less I/O
  • More complex: needs a change marker (timestamp or log)
  • Can miss hard deletes without CDC
What to Whiteboard
incremental readdelta only
Source Table
500M rows total
Filter: WHERE updated_at > last_run
50K changed rows
MERGE into Target
Upsert new + changed
The Curveball Follow-ups

After your initial answer, expect these probes

  • "What if the source has no updated_at column?" Full refresh is the only safe option. Or push the source team to add CDC. Some teams hash each row and compare, but that costs almost as much as a full reload.
  • "What about deleted rows?" Incremental with timestamps cannot detect hard deletes. You need CDC or soft deletes with a deleted_at flag.
  • "When would full refresh actually be better?" Small tables, tables with no reliable change marker, or after schema changes that invalidate old data.
1SELECT *
2FROM source.orders
3WHERE updated_at > '2025-03-24 00:00:00' CREATE OR REPLACE TABLE target.orders AS
4
5
6SELECT *
7FROM source.orders
KEY TAKEAWAYS
Say: 'Incremental by default, full refresh as a safety net. Incremental needs a change marker like updated_at.'
The one weakness to always mention: incremental misses hard deletes without CDC
Small tables under 10M rows? Full refresh is simpler and fast enough

Change Data Capture

Daily Life
Interviews

Explain CDC and why it beats timestamp-based loading

Interview Trigger Phrases

When you hear these in an interview, this is the concept being tested

  • "How do you capture changes from the source database?"
  • "What is CDC?"
  • "How do you detect deleted rows?"

What They Want to Hear

'CDC reads the database's own change log to capture every insert, update, and delete. Unlike timestamp-based incremental loading, CDC catches deletes and does not miss rows that changed between runs.' That is the answer. CDC solves the two biggest weaknesses of basic incremental loading: missed deletes and missed in-between changes.
MethodCatches InsertsCatches UpdatesCatches DeletesSource Impact
Timestamp filterYesYesNoQuery load on source
CDC (log-based)YesYesYesReads the WAL; minimal impact
Full refreshYesYesYesHeaviest: reads entire table

The Tools to Name-Drop

ToolTypeOne-Liner for Interviews
DebeziumOpen sourceReads database WAL and streams changes to Kafka
Fivetran / AirbyteManagedHandles CDC automatically; you configure the connection
AWS DMSCloud-nativeDatabase Migration Service with ongoing replication mode
The Curveball Follow-ups

After your initial answer, expect these probes

  • "What is a WAL?" Write-Ahead Log. The database writes every change to a log before applying it. CDC tools read this log to capture changes without querying the table directly.
  • "Does CDC add load to the source database?" Log-based CDC adds almost none. It reads a file the database already writes. Trigger-based CDC adds significant overhead because it runs code on every row change.
  • "When would you NOT use CDC?" When the source is a flat file, an API, or a system that does not expose a change log. CDC is specifically for databases.
KEY TAKEAWAYS
Say: 'CDC reads the database change log to capture inserts, updates, and deletes with minimal source impact.'
CDC solves the hard-delete problem that timestamp-based incremental loading cannot
Name Debezium for open source, Fivetran for managed. Know what a WAL is.

Slowly Changing Dimensions

Daily Life
Interviews

Answer SCD questions and explain the Type 2 join pattern

Interview Trigger Phrases

When you hear these in an interview, this is the concept being tested

  • "How do you track historical changes?"
  • "A customer changes their address. How does your pipeline handle that?"
  • "What is SCD Type 2?"

What They Want to Hear

'SCD Type 2 keeps a full history of changes. When a value changes, I close the current row by setting an end date, and insert a new row with the updated value and a new start date. This means I can always answer: what was the customer's address when they placed that order last March?' That is the answer. SCD is about preserving history so you can join facts to the dimension values that were true at the time.
SCD TypeWhat Happens on ChangeUse When
Type 1Overwrite the old valueHistory does not matter (e.g., fixing a typo)
Type 2Keep old row, insert new row with datesHistory matters (e.g., customer address, pricing)
Type 3Add a column for the previous valueOnly need one prior value (rare in practice)
What to Whiteboard
expire old recordactivate new record
Customer moves to NYC
Address changed
Close current row
Set end_date = today
Insert new row
start_date = today, is_current = true
The Curveball Follow-ups

After your initial answer, expect these probes

  • "How do you join a fact table to an SCD Type 2 dimension?" Join on the business key AND filter where the fact's event_date falls between the dimension's start_date and end_date. This gives you the dimension values that were true at the time of the event.
  • "Type 2 tables grow forever. How do you manage that?" In practice, dimension tables grow slowly. A customer table with 10M customers and 3 changes per customer averages 30M rows. That is manageable. Partition by is_current for fast lookups.
  • "When would you use Type 1 instead?" When history does not matter. Fixing a misspelling, correcting a data entry error, or updating a non-analytical attribute.
KEY TAKEAWAYS
Say: 'SCD Type 2 preserves history by closing the old row and inserting a new one with date ranges.'
The join pattern: fact.event_date BETWEEN dim.start_date AND dim.end_date
Type 1 overwrites, Type 2 keeps history. Know when to use each.

Schema Evolution

Daily Life
Interviews

Classify schema changes as safe or breaking and handle each

Interview Trigger Phrases

When you hear these in an interview, this is the concept being tested

  • "The source team adds a new column. What happens to your pipeline?"
  • "How do you handle schema changes?"
  • "What breaks when the source schema changes?"

What They Want to Hear

'I classify schema changes as additive or breaking. Adding a new column is additive and should be handled automatically. Renaming or removing a column is breaking and requires a migration plan. My pipeline detects schema drift on each run and either auto-adapts for additive changes or alerts the team for breaking ones.' That is the framework. Additive vs breaking. Auto-handle vs alert.
Additive (Safe)
  • New column added to source
  • New enum value in a category field
  • New table added to source schema
  • Action: auto-add, notify downstream
Breaking (Dangerous)
  • Column renamed or removed
  • Data type changed (string to int)
  • Primary key definition changed
  • Action: halt pipeline, alert, plan migration
The Curveball Follow-ups

After your initial answer, expect these probes

  • "How do you detect schema drift automatically?" Compare the source schema to the last known schema before processing. Tools like Great Expectations or custom checks can validate column names and types on each run.
  • "What if you need to backfill after a schema change?" That depends on whether historical data needs the new column. If yes, you backfill with a default value or re-extract from source. If no, you apply the change going forward only.
  • "How do file formats handle schema evolution?" Parquet and Avro both support adding columns without breaking existing readers. But removing or renaming columns requires rewriting files.
KEY TAKEAWAYS
Say: 'Additive changes auto-adapt. Breaking changes halt and alert. Detect drift on every run.'
Adding a column is safe. Removing or renaming a column breaks consumers.
Parquet supports additive schema evolution natively. That is why it is the standard.

Backfilling

Daily Life
Interviews

Explain how to backfill safely and prioritize correctly

Interview Trigger Phrases

When you hear these in an interview, this is the concept being tested

  • "You found a bug in your transformation. How do you fix historical data?"
  • "How do you backfill a table?"
  • "A pipeline was down for 3 days. What now?"

What They Want to Hear

'Backfilling means reprocessing historical data, usually because a bug corrupted it or a pipeline was down. I backfill by re-running the pipeline for a specific date range. The pipeline must be idempotent so that re-running produces the same result as running once. I process one partition at a time to avoid overloading the system.' That is the answer. Backfill = re-run, idempotency = safety, partition-by-partition = control.
What to Whiteboard
partition by partitionvalidate each batch
Detect Bad Data
Bug found or pipeline was down
Scope the Backfill
Which date range is affected?
Re-run Pipeline
One partition at a time, idempotent
Verify Output
Row counts, checksums, spot checks
The Curveball Follow-ups

After your initial answer, expect these probes

  • "How do you backfill without impacting production queries?" Process during off-peak hours. Write to a staging table first, then swap it in. Or use partition-level REPLACE so only one partition is unavailable at a time.
  • "What if the source data is no longer available for the backfill period?" This is why data retention policies matter. If the raw data in your landing zone is deleted after 30 days, you cannot backfill beyond that window.
  • "How do you prioritize: backfill old data or keep processing new data?" New data first, always. Run backfills in a separate pipeline or lower-priority queue so current processing is not blocked.
KEY TAKEAWAYS
Say: 'Backfill means re-running the pipeline for a historical date range. It requires idempotent writes.'
Always scope the backfill first: which partitions are affected?
New data takes priority over backfills. Never block production processing.

Answer the incremental loading question that follows every pipeline design

Category
Pipeline Architecture
Difficulty
beginner
Duration
20 minutes
Challenges
0 hands-on challenges

Topics covered: Full vs Incremental Loading, Change Data Capture, Slowly Changing Dimensions, Schema Evolution, Backfilling

Lesson Sections

  1. Full vs Incremental Loading (concepts: paFullVsIncremental)

    What They Want to Hear 'Full refresh drops the entire table and reloads from scratch. Incremental only processes rows that changed since the last run. I default to incremental because it is faster and cheaper, but I run a full refresh weekly as a safety net.' That is the answer. Two strategies, a default choice, and the safety valve.

  2. Change Data Capture (concepts: paCdc)

    What They Want to Hear 'CDC reads the database's own change log to capture every insert, update, and delete. Unlike timestamp-based incremental loading, CDC catches deletes and does not miss rows that changed between runs.' That is the answer. CDC solves the two biggest weaknesses of basic incremental loading: missed deletes and missed in-between changes. The Tools to Name-Drop

  3. Slowly Changing Dimensions (concepts: paScdPipeline)

    What They Want to Hear 'SCD Type 2 keeps a full history of changes. When a value changes, I close the current row by setting an end date, and insert a new row with the updated value and a new start date. This means I can always answer: what was the customer's address when they placed that order last March?' That is the answer. SCD is about preserving history so you can join facts to the dimension values that were true at the time.

  4. Schema Evolution (concepts: paSchemaEvolution)

    What They Want to Hear 'I classify schema changes as additive or breaking. Adding a new column is additive and should be handled automatically. Renaming or removing a column is breaking and requires a migration plan. My pipeline detects schema drift on each run and either auto-adapts for additive changes or alerts the team for breaking ones.' That is the framework. Additive vs breaking. Auto-handle vs alert.

  5. Backfilling (concepts: paBackfill)

    What They Want to Hear 'Backfilling means reprocessing historical data, usually because a bug corrupted it or a pipeline was down. I backfill by re-running the pipeline for a specific date range. The pipeline must be idempotent so that re-running produces the same result as running once. I process one partition at a time to avoid overloading the system.' That is the answer. Backfill = re-run, idempotency = safety, partition-by-partition = control.