Keeping Data Fresh: Intermediate

You said 'incremental loading.' The interviewer leans in: 'Walk me through your merge strategy.' Or: 'How do you handle a schema change mid-stream?' Or: 'Your CDC pipeline is 6 hours behind. What do you do?' These questions test whether you have operated incremental pipelines in production, not just described them.

What you will be able to do

Choose between MERGE, partition REPLACE, and append-then-deduplicate with tradeoff reasoning

Explain WAL-based vs trigger-based CDC and when each fits

Design a backfill that runs alongside production without disruption

Merge Strategies

Daily Life

Interviews

Pick the right merge strategy and defend it under pressure

Interview Trigger Phrases

When you hear these in an interview, this is the concept being tested

▸"Walk me through your merge strategy."
▸"How do you handle upserts at scale?"
▸"MERGE is slow on your billion-row table. What do you do?"

What They Want to Hear

'I pick the merge strategy based on table size and access pattern. For tables under 100M rows, MERGE/UPSERT on the primary key is straightforward and correct. For larger tables, I use partition REPLACE: delete the entire partition for the date range, then insert fresh data. This avoids the row-level matching that makes MERGE slow at scale.' This is the answer that shows you have hit the performance wall and solved it.

Strategy	How It Works	Best For	Watch Out For
MERGE/UPSERT	Match on key, insert new, update existing	Tables < 100M rows, random updates	Slow at scale; full table scan to match
Partition REPLACE	Delete partition, insert fresh data	Date-partitioned tables, daily batch	Must reprocess entire partition
Append + deduplicate	Append all, deduplicate on read or in a later step	Immutable event stores, streaming sinks	Storage grows; read-time dedup costs

	DELETE
	FROM analytics.orders
	WHERE order_date = '2025-03-15' ; INSERT INTO analytics.orders

	SELECT *
	FROM staging.orders_delta
	WHERE order_date = '2025-03-15' ;

The Curveball Follow-ups

After your initial answer, expect these probes

▸"What if updates span multiple partitions?" Stage the delta into a temp table, then MERGE only against the affected partitions. This scopes the match to a fraction of the table.
▸"How do you guarantee consistency during a partition REPLACE?" Use a transaction. Delete and insert in the same transaction so readers never see a partial state. Delta Lake and Iceberg handle this natively with ACID.
▸"What about late-arriving data from yesterday landing in today's run?" Route it to the correct partition by the event timestamp, not the processing timestamp. Then the next backfill or partition REPLACE corrects it.

KEY TAKEAWAYS

Say: 'MERGE for small tables, partition REPLACE for large. I pick based on table size and update pattern.'

Partition REPLACE avoids the row-level scan that makes MERGE slow at billions of rows

Always wrap delete + insert in a transaction for consistency

CDC Patterns

Daily Life

Interviews

Explain CDC end-to-end and diagnose lag

Interview Trigger Phrases

When you hear these in an interview, this is the concept being tested

▸"How does your CDC pipeline work end to end?"
▸"WAL-based or trigger-based: which do you prefer and why?"
▸"Your CDC replication is 6 hours behind. Diagnose it."

What They Want to Hear

'I use WAL-based CDC because it has near-zero impact on the source database. Debezium reads the Postgres WAL or MySQL binlog, streams change events to Kafka, and my pipeline consumes from Kafka to apply inserts, updates, and deletes to the target. I avoid trigger-based CDC because triggers add latency to every write on the source and are fragile at scale.' This is the answer that shows you have run CDC in production and understand the operational tradeoffs.

•WAL-Based CDC

Reads the database's existing change log
Near-zero impact on source performance
Captures all operations including deletes
Requires WAL retention config on the source

•Trigger-Based CDC

Fires custom code on every row change
Adds latency to every source write
Fragile: triggers can be disabled or forgotten
Works on databases without WAL access

Storage

Source Database

Transform

Debezium

Queue

Kafka Topic

Consumer

Pipeline Consumer

What to Whiteboard

The Curveball Follow-ups

After your initial answer, expect these probes

▸"Your CDC pipeline is lagging 6 hours. What do you check?" Three things: (1) WAL retention: is the source database purging WAL segments before Debezium reads them? (2) Kafka consumer lag: is the consumer falling behind? (3) Target write throughput: is the MERGE or apply step the bottleneck?
▸"What happens when the source database does a DDL change like ALTER TABLE?" WAL-based CDC captures DDL events. Debezium emits a schema change event. Your consumer must handle it: auto-add new columns or alert for breaking changes.
▸"How do you handle the initial snapshot?" Debezium takes a snapshot of the full table first, then switches to streaming WAL changes. The snapshot is the baseline; the stream is the delta going forward.

KEY TAKEAWAYS

Say: 'WAL-based CDC with Debezium. Near-zero source impact, captures all operations including deletes.'

The three lag checkpoints: WAL retention, Kafka consumer lag, target write speed

Initial snapshot first, then WAL streaming. The snapshot is the baseline.

SCD in Pipelines

Daily Life

Interviews

Implement SCD Type 2 with MERGE and defend surrogate key design

Interview Trigger Phrases

When you hear these in an interview, this is the concept being tested

▸"How do you implement SCD Type 2 in your pipeline?"
▸"Walk me through the MERGE logic for slowly changing dimensions."
▸"How do you generate surrogate keys in a distributed system?"

What They Want to Hear

'I implement SCD Type 2 with a MERGE statement that does two things: when a matching row's attributes have changed, it closes the current row by setting end_date and is_current = false, and inserts a new row with the updated values. The surrogate key is a hash of the business key plus the start_date, which makes it deterministic and idempotent.' This is the answer that shows you have built SCD pipelines, not just drawn them on a whiteboard.

	MERGE INTO dim_customer AS target
	USING staging_customer AS source
	ON target.customer_id = source.customer_id AND target.is_current = TRUE WHEN MATCHED AND(target.address != source.address OR target.tier != source.tier) THEN UPDATE SET end_date = CURRENT_DATE, is_current = FALSE ; INSERT INTO dim_customer




	SELECT
	MD5(customer_id \|\| CURRENT_DATE) AS surrogate_key,
	customer_id,
	name,
	address,
	tier,
	CURRENT_DATE AS start_date,
	'9999-12-31' AS end_date,
	TRUE AS is_current
	FROM staging_customer s
	WHERE NOT EXISTS(SELECT 1 FROM dim_customer d WHERE d.customer_id = s.customer_id AND d.is_current = TRUE AND d.address = s.address AND d.tier = s.tier) ;

The Curveball Follow-ups

After your initial answer, expect these probes

▸"Why hash for surrogate keys instead of auto-increment?" Hashing is deterministic. Re-running the pipeline produces the same surrogate key, which makes the pipeline idempotent. Auto-increment generates different keys on re-run.
▸"What if you get multiple changes for the same customer in one batch?" Sort by change timestamp, apply in order. Only the final state becomes the current row. Intermediate states get compressed into closed historical rows.
▸"How do you test SCD logic?" Run the pipeline twice with the same input and verify the dimension table is identical. Then run with a known change and verify exactly one row closes and one new row appears.

KEY TAKEAWAYS

Say: 'MERGE to close changed rows, insert new versions. Hash surrogate keys for idempotency.'

Hash-based surrogate keys are deterministic. Auto-increment is not. This matters for re-runs.

Test SCD by running twice on the same input: the table should be identical

Schema Migration

Daily Life

Interviews

Design a schema migration strategy with backwards compatibility

Interview Trigger Phrases

When you hear these in an interview, this is the concept being tested

▸"A source team renames a column. How do you handle it?"
▸"How do you roll out a schema change across 50 downstream consumers?"
▸"What is backwards compatibility in the context of data pipelines?"

What They Want to Hear

'I enforce backwards compatibility by default. New columns are added with a default value. Old columns are never removed in the same release as the new ones: I deprecate first, migrate consumers, then remove. For breaking changes, I version the schema and run both versions in parallel during the migration window.' This is the answer that shows you think about consumers, not just your own pipeline.

Change Type	Strategy	Consumer Impact
Add column	Add with default value, notify consumers	None if default is set
Rename column	Add new name, copy data, deprecate old name, remove after migration	Requires consumer update
Remove column	Deprecation warning, migration window, then drop	Breaks consumers reading that column
Change data type	New column with new type, backfill, migrate consumers, drop old	Requires coordinated release

The Curveball Follow-ups

After your initial answer, expect these probes

▸"How do you prevent a source team from making breaking changes without telling you?" Schema registry. The source publishes its schema. Your pipeline validates incoming data against the registered schema. Breaking changes fail validation before corrupting downstream tables.
▸"What about schema changes in event streams?" Avro with a schema registry (Confluent Schema Registry) enforces compatibility rules: BACKWARD, FORWARD, or FULL. The registry rejects schemas that break the compatibility mode.
▸"How long should a deprecation window be?" Long enough for every consumer to migrate. In practice, 2-4 weeks with active communication. Track who still reads the deprecated column and follow up.

KEY TAKEAWAYS

Say: 'Backwards compatible by default. Add, never remove in the same release. Schema registry for enforcement.'

Deprecate, migrate consumers, then remove. Never skip the middle step.

Schema registry prevents breaking changes from reaching your pipeline

Partition-Level Backfill

Daily Life

Interviews

Design a production-safe backfill strategy with Airflow

Interview Trigger Phrases

When you hear these in an interview, this is the concept being tested

▸"You need to backfill 90 days of data. Walk me through it."
▸"How do you backfill without impacting production?"
▸"How do you coordinate backfills with Airflow?"

What They Want to Hear

'I backfill at the partition level. Each partition is an independent unit of work: I can re-run it without affecting other partitions. In Airflow, I use the catchup feature or a dedicated backfill DAG with a configurable date range. I run backfills with lower priority than production tasks and validate each partition before moving to the next.' This is the answer that shows you have done this operationally, not theoretically.

Source

Configure Range

Transform

Process Partition

Quality

Validate + Next

Queue

Queue Partitions

What to Whiteboard

	# Airflow backfill command
	airflow dags backfill \
	--start-date 2025-01-01 \
	--end-date 2025-03-31 \
	--reset-dagruns \
	daily_orders_pipeline

	# Or: dedicated backfill DAG with date_range param
	# Runs at lower pool priority than production

The Curveball Follow-ups

After your initial answer, expect these probes

▸"What if the backfill takes longer than the time until the next daily run?" Run backfills in a separate worker pool. Production DAGs have higher priority. If they conflict on the same partition, production wins and backfill retries.
▸"How do you handle downstream dependencies during a backfill?" Downstream tables are usually partitioned the same way. Backfill upstream first, then trigger downstream for the same date range. Do not let downstream run on stale upstream data.
▸"Can you do a rolling backfill that does not require downtime?" Yes. Partition REPLACE means each partition is atomic. Consumers querying other partitions are unaffected. Only the partition being replaced is briefly inconsistent.

KEY TAKEAWAYS

Say: 'Partition-level backfill. Each partition is independent, idempotent, and validated before moving on.'

Lower priority than production. Separate worker pool or queue.

Backfill upstream first, then downstream. Order matters.

Master the incremental loading patterns that interviewers probe hardest

Category: Pipeline Architecture
Difficulty: intermediate
Duration: 25 minutes
Challenges: 0 hands-on challenges

Topics covered: Merge Strategies, CDC Patterns, SCD in Pipelines, Schema Migration, Partition-Level Backfill

Lesson Sections

Merge Strategies (concepts: paFullVsIncremental)
What They Want to Hear 'I pick the merge strategy based on table size and access pattern. For tables under 100M rows, MERGE/UPSERT on the primary key is straightforward and correct. For larger tables, I use partition REPLACE: delete the entire partition for the date range, then insert fresh data. This avoids the row-level matching that makes MERGE slow at scale.' This is the answer that shows you have hit the performance wall and solved it.
CDC Patterns (concepts: paCdc)
What They Want to Hear 'I use WAL-based CDC because it has near-zero impact on the source database. Debezium reads the Postgres WAL or MySQL binlog, streams change events to Kafka, and my pipeline consumes from Kafka to apply inserts, updates, and deletes to the target. I avoid trigger-based CDC because triggers add latency to every write on the source and are fragile at scale.' This is the answer that shows you have run CDC in production and understand the operational tradeoffs.
SCD in Pipelines (concepts: paScdPipeline)
What They Want to Hear 'I implement SCD Type 2 with a MERGE statement that does two things: when a matching row's attributes have changed, it closes the current row by setting end_date and is_current = false, and inserts a new row with the updated values. The surrogate key is a hash of the business key plus the start_date, which makes it deterministic and idempotent.' This is the answer that shows you have built SCD pipelines, not just drawn them on a whiteboard.
Schema Migration (concepts: paSchemaEvolution)
What They Want to Hear 'I enforce backwards compatibility by default. New columns are added with a default value. Old columns are never removed in the same release as the new ones: I deprecate first, migrate consumers, then remove. For breaking changes, I version the schema and run both versions in parallel during the migration window.' This is the answer that shows you think about consumers, not just your own pipeline.
Partition-Level Backfill (concepts: paBackfill)
What They Want to Hear 'I backfill at the partition level. Each partition is an independent unit of work: I can re-run it without affecting other partitions. In Airflow, I use the catchup feature or a dedicated backfill DAG with a configurable date range. I run backfills with lower priority than production tasks and validate each partition before moving to the next.' This is the answer that shows you have done this operationally, not theoretically.