Keeping Data Fresh: Beginner

You designed the pipeline. The interviewer nods. Then: 'How do you keep it up to date?' This is the most common follow-up in pipeline interviews. They want to hear full refresh vs incremental, change data capture, and what happens when you need to reprocess history. Here is exactly how to answer.

Full vs Incremental Loading

Daily Life

Interviews

Explain full vs incremental loading and pick the right one

Interview Trigger Phrases

When you hear these in an interview, this is the concept being tested

▸"Do you reload everything or just the changes?"
▸"How do you handle incremental updates?"
▸"Full refresh vs incremental: when do you use each?"

What They Want to Hear

'Full refresh drops the entire table and reloads from scratch. Incremental only processes rows that changed since the last run. I default to incremental because it is faster and cheaper, but I run a full refresh weekly as a safety net.' That is the answer. Two strategies, a default choice, and the safety valve.

•Full Refresh

Simplest to implement: SELECT * and overwrite
Guaranteed correct: no missed updates
Scales poorly: 1 billion rows takes hours
Safe default for small tables under 10M rows

•Incremental

Faster: only process new or changed rows
Cheaper: less compute, less I/O
More complex: needs a change marker (timestamp or log)
Can miss hard deletes without CDC

What to Whiteboard

Source Table

500M rows total

Filter: WHERE updated_at > last_run

50K changed rows

MERGE into Target

Upsert new + changed

The Curveball Follow-ups

After your initial answer, expect these probes

▸"What if the source has no updated_at column?" Full refresh is the only safe option. Or push the source team to add CDC. Some teams hash each row and compare, but that costs almost as much as a full reload.
▸"What about deleted rows?" Incremental with timestamps cannot detect hard deletes. You need CDC or soft deletes with a deleted_at flag.
▸"When would full refresh actually be better?" Small tables, tables with no reliable change marker, or after schema changes that invalidate old data.

1	SELECT *
2	FROM source.orders
3	WHERE updated_at > '2025-03-24 00:00:00' CREATE OR REPLACE TABLE target.orders AS
4
5
6	SELECT *
7	FROM source.orders

KEY TAKEAWAYS

Say: 'Incremental by default, full refresh as a safety net. Incremental needs a change marker like updated_at.'

The one weakness to always mention: incremental misses hard deletes without CDC

Small tables under 10M rows? Full refresh is simpler and fast enough

Change Data Capture

Daily Life

Interviews

Explain CDC and why it beats timestamp-based loading

Interview Trigger Phrases

When you hear these in an interview, this is the concept being tested

▸"How do you capture changes from the source database?"
▸"What is CDC?"
▸"How do you detect deleted rows?"

What They Want to Hear

'CDC reads the database's own change log to capture every insert, update, and delete. Unlike timestamp-based incremental loading, CDC catches deletes and does not miss rows that changed between runs.' That is the answer. CDC solves the two biggest weaknesses of basic incremental loading: missed deletes and missed in-between changes.

Method	Catches Inserts	Catches Updates	Catches Deletes	Source Impact
Timestamp filter	Yes	Yes	No	Query load on source
CDC (log-based)	Yes	Yes	Yes	Reads the WAL; minimal impact
Full refresh	Yes	Yes	Yes	Heaviest: reads entire table

The Tools to Name-Drop

Tool	Type	One-Liner for Interviews
Debezium	Open source	Reads database WAL and streams changes to Kafka
Fivetran / Airbyte	Managed	Handles CDC automatically; you configure the connection
AWS DMS	Cloud-native	Database Migration Service with ongoing replication mode

The Curveball Follow-ups

After your initial answer, expect these probes

▸"What is a WAL?" Write-Ahead Log. The database writes every change to a log before applying it. CDC tools read this log to capture changes without querying the table directly.
▸"Does CDC add load to the source database?" Log-based CDC adds almost none. It reads a file the database already writes. Trigger-based CDC adds significant overhead because it runs code on every row change.
▸"When would you NOT use CDC?" When the source is a flat file, an API, or a system that does not expose a change log. CDC is specifically for databases.

KEY TAKEAWAYS

Say: 'CDC reads the database change log to capture inserts, updates, and deletes with minimal source impact.'

CDC solves the hard-delete problem that timestamp-based incremental loading cannot

Name Debezium for open source, Fivetran for managed. Know what a WAL is.

Slowly Changing Dimensions

Daily Life

Interviews

Answer SCD questions and explain the Type 2 join pattern

Interview Trigger Phrases

When you hear these in an interview, this is the concept being tested

▸"How do you track historical changes?"
▸"A customer changes their address. How does your pipeline handle that?"
▸"What is SCD Type 2?"

What They Want to Hear

'SCD Type 2 keeps a full history of changes. When a value changes, I close the current row by setting an end date, and insert a new row with the updated value and a new start date. This means I can always answer: what was the customer's address when they placed that order last March?' That is the answer. SCD is about preserving history so you can join facts to the dimension values that were true at the time.

SCD Type	What Happens on Change	Use When
Type 1	Overwrite the old value	History does not matter (e.g., fixing a typo)
Type 2	Keep old row, insert new row with dates	History matters (e.g., customer address, pricing)
Type 3	Add a column for the previous value	Only need one prior value (rare in practice)

What to Whiteboard

Customer moves to NYC

Address changed

Close current row

Set end_date = today

Insert new row

start_date = today, is_current = true

The Curveball Follow-ups

After your initial answer, expect these probes

▸"How do you join a fact table to an SCD Type 2 dimension?" Join on the business key AND filter where the fact's event_date falls between the dimension's start_date and end_date. This gives you the dimension values that were true at the time of the event.
▸"Type 2 tables grow forever. How do you manage that?" In practice, dimension tables grow slowly. A customer table with 10M customers and 3 changes per customer averages 30M rows. That is manageable. Partition by is_current for fast lookups.
▸"When would you use Type 1 instead?" When history does not matter. Fixing a misspelling, correcting a data entry error, or updating a non-analytical attribute.

KEY TAKEAWAYS

Say: 'SCD Type 2 preserves history by closing the old row and inserting a new one with date ranges.'

The join pattern: fact.event_date BETWEEN dim.start_date AND dim.end_date

Type 1 overwrites, Type 2 keeps history. Know when to use each.

Schema Evolution

Daily Life

Interviews

Classify schema changes as safe or breaking and handle each

Interview Trigger Phrases

When you hear these in an interview, this is the concept being tested

▸"The source team adds a new column. What happens to your pipeline?"
▸"How do you handle schema changes?"
▸"What breaks when the source schema changes?"

What They Want to Hear

'I classify schema changes as additive or breaking. Adding a new column is additive and should be handled automatically. Renaming or removing a column is breaking and requires a migration plan. My pipeline detects schema drift on each run and either auto-adapts for additive changes or alerts the team for breaking ones.' That is the framework. Additive vs breaking. Auto-handle vs alert.

•Additive (Safe)

New column added to source
New enum value in a category field
New table added to source schema
Action: auto-add, notify downstream

•Breaking (Dangerous)

Column renamed or removed
Data type changed (string to int)
Primary key definition changed
Action: halt pipeline, alert, plan migration

The Curveball Follow-ups

After your initial answer, expect these probes

▸"How do you detect schema drift automatically?" Compare the source schema to the last known schema before processing. Tools like Great Expectations or custom checks can validate column names and types on each run.
▸"What if you need to backfill after a schema change?" That depends on whether historical data needs the new column. If yes, you backfill with a default value or re-extract from source. If no, you apply the change going forward only.
▸"How do file formats handle schema evolution?" Parquet and Avro both support adding columns without breaking existing readers. But removing or renaming columns requires rewriting files.

KEY TAKEAWAYS

Say: 'Additive changes auto-adapt. Breaking changes halt and alert. Detect drift on every run.'

Adding a column is safe. Removing or renaming a column breaks consumers.

Parquet supports additive schema evolution natively. That is why it is the standard.

Backfilling

Daily Life

Interviews

Explain how to backfill safely and prioritize correctly

Interview Trigger Phrases

When you hear these in an interview, this is the concept being tested

▸"You found a bug in your transformation. How do you fix historical data?"
▸"How do you backfill a table?"
▸"A pipeline was down for 3 days. What now?"

What They Want to Hear

'Backfilling means reprocessing historical data, usually because a bug corrupted it or a pipeline was down. I backfill by re-running the pipeline for a specific date range. The pipeline must be idempotent so that re-running produces the same result as running once. I process one partition at a time to avoid overloading the system.' That is the answer. Backfill = re-run, idempotency = safety, partition-by-partition = control.

What to Whiteboard

Detect Bad Data

Bug found or pipeline was down

Scope the Backfill

Which date range is affected?

Re-run Pipeline

One partition at a time, idempotent

Verify Output

Row counts, checksums, spot checks

The Curveball Follow-ups

After your initial answer, expect these probes

▸"How do you backfill without impacting production queries?" Process during off-peak hours. Write to a staging table first, then swap it in. Or use partition-level REPLACE so only one partition is unavailable at a time.
▸"What if the source data is no longer available for the backfill period?" This is why data retention policies matter. If the raw data in your landing zone is deleted after 30 days, you cannot backfill beyond that window.
▸"How do you prioritize: backfill old data or keep processing new data?" New data first, always. Run backfills in a separate pipeline or lower-priority queue so current processing is not blocked.

KEY TAKEAWAYS

Say: 'Backfill means re-running the pipeline for a historical date range. It requires idempotent writes.'

Always scope the backfill first: which partitions are affected?

New data takes priority over backfills. Never block production processing.

Answer the incremental loading question that follows every pipeline design

Category: Pipeline Architecture
Difficulty: beginner
Duration: 20 minutes
Challenges: 0 hands-on challenges

Topics covered: Full vs Incremental Loading, Change Data Capture, Slowly Changing Dimensions, Schema Evolution, Backfilling

Lesson Sections

Full vs Incremental Loading (concepts: paFullVsIncremental)
What They Want to Hear 'Full refresh drops the entire table and reloads from scratch. Incremental only processes rows that changed since the last run. I default to incremental because it is faster and cheaper, but I run a full refresh weekly as a safety net.' That is the answer. Two strategies, a default choice, and the safety valve.
Change Data Capture (concepts: paCdc)
What They Want to Hear 'CDC reads the database's own change log to capture every insert, update, and delete. Unlike timestamp-based incremental loading, CDC catches deletes and does not miss rows that changed between runs.' That is the answer. CDC solves the two biggest weaknesses of basic incremental loading: missed deletes and missed in-between changes. The Tools to Name-Drop
Slowly Changing Dimensions (concepts: paScdPipeline)
What They Want to Hear 'SCD Type 2 keeps a full history of changes. When a value changes, I close the current row by setting an end date, and insert a new row with the updated value and a new start date. This means I can always answer: what was the customer's address when they placed that order last March?' That is the answer. SCD is about preserving history so you can join facts to the dimension values that were true at the time.
Schema Evolution (concepts: paSchemaEvolution)
What They Want to Hear 'I classify schema changes as additive or breaking. Adding a new column is additive and should be handled automatically. Renaming or removing a column is breaking and requires a migration plan. My pipeline detects schema drift on each run and either auto-adapts for additive changes or alerts the team for breaking ones.' That is the framework. Additive vs breaking. Auto-handle vs alert.
Backfilling (concepts: paBackfill)
What They Want to Hear 'Backfilling means reprocessing historical data, usually because a bug corrupted it or a pipeline was down. I backfill by re-running the pipeline for a specific date range. The pipeline must be idempotent so that re-running produces the same result as running once. I process one partition at a time to avoid overloading the system.' That is the answer. Backfill = re-run, idempotency = safety, partition-by-partition = control.