Keeping Data Fresh: Intermediate

You said 'incremental loading.' The interviewer leans in: 'Walk me through your merge strategy.' Or: 'How do you handle a schema change mid-stream?' Or: 'Your CDC pipeline is 6 hours behind. What do you do?' These questions test whether you have operated incremental pipelines in production, not just described them.

Merge Strategies

Daily Life
Interviews

Pick the right merge strategy and defend it under pressure

Interview Trigger Phrases

When you hear these in an interview, this is the concept being tested

  • "Walk me through your merge strategy."
  • "How do you handle upserts at scale?"
  • "MERGE is slow on your billion-row table. What do you do?"

What They Want to Hear

'I pick the merge strategy based on table size and access pattern. For tables under 100M rows, MERGE/UPSERT on the primary key is straightforward and correct. For larger tables, I use partition REPLACE: delete the entire partition for the date range, then insert fresh data. This avoids the row-level matching that makes MERGE slow at scale.' This is the answer that shows you have hit the performance wall and solved it.
StrategyHow It WorksBest ForWatch Out For
MERGE/UPSERTMatch on key, insert new, update existingTables < 100M rows, random updatesSlow at scale; full table scan to match
Partition REPLACEDelete partition, insert fresh dataDate-partitioned tables, daily batchMust reprocess entire partition
Append + deduplicateAppend all, deduplicate on read or in a later stepImmutable event stores, streaming sinksStorage grows; read-time dedup costs
1DELETE
2FROM analytics.orders
3WHERE order_date = '2025-03-15' ; INSERT INTO analytics.orders
4
5SELECT *
6FROM staging.orders_delta
7WHERE order_date = '2025-03-15' ;
The Curveball Follow-ups

After your initial answer, expect these probes

  • "What if updates span multiple partitions?" Stage the delta into a temp table, then MERGE only against the affected partitions. This scopes the match to a fraction of the table.
  • "How do you guarantee consistency during a partition REPLACE?" Use a transaction. Delete and insert in the same transaction so readers never see a partial state. Delta Lake and Iceberg handle this natively with ACID.
  • "What about late-arriving data from yesterday landing in today's run?" Route it to the correct partition by the event timestamp, not the processing timestamp. Then the next backfill or partition REPLACE corrects it.
KEY TAKEAWAYS
Say: 'MERGE for small tables, partition REPLACE for large. I pick based on table size and update pattern.'
Partition REPLACE avoids the row-level scan that makes MERGE slow at billions of rows
Always wrap delete + insert in a transaction for consistency

CDC Patterns

Daily Life
Interviews

Explain CDC end-to-end and diagnose lag

Interview Trigger Phrases

When you hear these in an interview, this is the concept being tested

  • "How does your CDC pipeline work end to end?"
  • "WAL-based or trigger-based: which do you prefer and why?"
  • "Your CDC replication is 6 hours behind. Diagnose it."

What They Want to Hear

'I use WAL-based CDC because it has near-zero impact on the source database. Debezium reads the Postgres WAL or MySQL binlog, streams change events to Kafka, and my pipeline consumes from Kafka to apply inserts, updates, and deletes to the target. I avoid trigger-based CDC because triggers add latency to every write on the source and are fragile at scale.' This is the answer that shows you have run CDC in production and understand the operational tradeoffs.
WAL-Based CDC
  • Reads the database's existing change log
  • Near-zero impact on source performance
  • Captures all operations including deletes
  • Requires WAL retention config on the source
Trigger-Based CDC
  • Fires custom code on every row change
  • Adds latency to every source write
  • Fragile: triggers can be disabled or forgotten
  • Works on databases without WAL access
What to Whiteboard
WAL streamchange eventsconsume + apply
Source Database
Postgres / MySQL
Debezium
Reads WAL / binlog
Kafka Topic
Change events (I/U/D)
Pipeline Consumer
Apply changes to target
The Curveball Follow-ups

After your initial answer, expect these probes

  • "Your CDC pipeline is lagging 6 hours. What do you check?" Three things: (1) WAL retention: is the source database purging WAL segments before Debezium reads them? (2) Kafka consumer lag: is the consumer falling behind? (3) Target write throughput: is the MERGE or apply step the bottleneck?
  • "What happens when the source database does a DDL change like ALTER TABLE?" WAL-based CDC captures DDL events. Debezium emits a schema change event. Your consumer must handle it: auto-add new columns or alert for breaking changes.
  • "How do you handle the initial snapshot?" Debezium takes a snapshot of the full table first, then switches to streaming WAL changes. The snapshot is the baseline; the stream is the delta going forward.
KEY TAKEAWAYS
Say: 'WAL-based CDC with Debezium. Near-zero source impact, captures all operations including deletes.'
The three lag checkpoints: WAL retention, Kafka consumer lag, target write speed
Initial snapshot first, then WAL streaming. The snapshot is the baseline.

SCD in Pipelines

Daily Life
Interviews

Implement SCD Type 2 with MERGE and defend surrogate key design

Interview Trigger Phrases

When you hear these in an interview, this is the concept being tested

  • "How do you implement SCD Type 2 in your pipeline?"
  • "Walk me through the MERGE logic for slowly changing dimensions."
  • "How do you generate surrogate keys in a distributed system?"

What They Want to Hear

'I implement SCD Type 2 with a MERGE statement that does two things: when a matching row's attributes have changed, it closes the current row by setting end_date and is_current = false, and inserts a new row with the updated values. The surrogate key is a hash of the business key plus the start_date, which makes it deterministic and idempotent.' This is the answer that shows you have built SCD pipelines, not just drawn them on a whiteboard.
1MERGE INTO dim_customer AS target
2 USING staging_customer AS source
3 ON target.customer_id = source.customer_id AND target.is_current = TRUE WHEN MATCHED AND(target.address != source.address OR target.tier != source.tier) THEN UPDATE SET end_date = CURRENT_DATE, is_current = FALSE ; INSERT INTO dim_customer
4
5
6
7
8SELECT
9 MD5(customer_id || CURRENT_DATE) AS surrogate_key,
10 customer_id,
11 name,
12 address,
13 tier,
14 CURRENT_DATE AS start_date,
15 '9999-12-31' AS end_date,
16 TRUE AS is_current
17FROM staging_customer s
18WHERE NOT EXISTS(SELECT 1 FROM dim_customer d WHERE d.customer_id = s.customer_id AND d.is_current = TRUE AND d.address = s.address AND d.tier = s.tier) ;
The Curveball Follow-ups

After your initial answer, expect these probes

  • "Why hash for surrogate keys instead of auto-increment?" Hashing is deterministic. Re-running the pipeline produces the same surrogate key, which makes the pipeline idempotent. Auto-increment generates different keys on re-run.
  • "What if you get multiple changes for the same customer in one batch?" Sort by change timestamp, apply in order. Only the final state becomes the current row. Intermediate states get compressed into closed historical rows.
  • "How do you test SCD logic?" Run the pipeline twice with the same input and verify the dimension table is identical. Then run with a known change and verify exactly one row closes and one new row appears.
KEY TAKEAWAYS
Say: 'MERGE to close changed rows, insert new versions. Hash surrogate keys for idempotency.'
Hash-based surrogate keys are deterministic. Auto-increment is not. This matters for re-runs.
Test SCD by running twice on the same input: the table should be identical

Schema Migration

Daily Life
Interviews

Design a schema migration strategy with backwards compatibility

Interview Trigger Phrases

When you hear these in an interview, this is the concept being tested

  • "A source team renames a column. How do you handle it?"
  • "How do you roll out a schema change across 50 downstream consumers?"
  • "What is backwards compatibility in the context of data pipelines?"

What They Want to Hear

'I enforce backwards compatibility by default. New columns are added with a default value. Old columns are never removed in the same release as the new ones: I deprecate first, migrate consumers, then remove. For breaking changes, I version the schema and run both versions in parallel during the migration window.' This is the answer that shows you think about consumers, not just your own pipeline.
Change TypeStrategyConsumer Impact
Add columnAdd with default value, notify consumersNone if default is set
Rename columnAdd new name, copy data, deprecate old name, remove after migrationRequires consumer update
Remove columnDeprecation warning, migration window, then dropBreaks consumers reading that column
Change data typeNew column with new type, backfill, migrate consumers, drop oldRequires coordinated release
The Curveball Follow-ups

After your initial answer, expect these probes

  • "How do you prevent a source team from making breaking changes without telling you?" Schema registry. The source publishes its schema. Your pipeline validates incoming data against the registered schema. Breaking changes fail validation before corrupting downstream tables.
  • "What about schema changes in event streams?" Avro with a schema registry (Confluent Schema Registry) enforces compatibility rules: BACKWARD, FORWARD, or FULL. The registry rejects schemas that break the compatibility mode.
  • "How long should a deprecation window be?" Long enough for every consumer to migrate. In practice, 2-4 weeks with active communication. Track who still reads the deprecated column and follow up.
KEY TAKEAWAYS
Say: 'Backwards compatible by default. Add, never remove in the same release. Schema registry for enforcement.'
Deprecate, migrate consumers, then remove. Never skip the middle step.
Schema registry prevents breaking changes from reaching your pipeline

Partition-Level Backfill

Daily Life
Interviews

Design a production-safe backfill strategy with Airflow

Interview Trigger Phrases

When you hear these in an interview, this is the concept being tested

  • "You need to backfill 90 days of data. Walk me through it."
  • "How do you backfill without impacting production?"
  • "How do you coordinate backfills with Airflow?"

What They Want to Hear

'I backfill at the partition level. Each partition is an independent unit of work: I can re-run it without affecting other partitions. In Airflow, I use the catchup feature or a dedicated backfill DAG with a configurable date range. I run backfills with lower priority than production tasks and validate each partition before moving to the next.' This is the answer that shows you have done this operationally, not theoretically.
What to Whiteboard
one at a timenext partition
Configure Range
start_date, end_date, concurrency
Queue Partitions
90 partitions, 3 at a time
Process Partition
Same logic as daily run
Validate + Next
Row count, checksum
1# Airflow backfill command
2airflow dags backfill \
3 --start-date 2025-01-01 \
4 --end-date 2025-03-31 \
5 --reset-dagruns \
6 daily_orders_pipeline
7
8# Or: dedicated backfill DAG with date_range param
9# Runs at lower pool priority than production
The Curveball Follow-ups

After your initial answer, expect these probes

  • "What if the backfill takes longer than the time until the next daily run?" Run backfills in a separate worker pool. Production DAGs have higher priority. If they conflict on the same partition, production wins and backfill retries.
  • "How do you handle downstream dependencies during a backfill?" Downstream tables are usually partitioned the same way. Backfill upstream first, then trigger downstream for the same date range. Do not let downstream run on stale upstream data.
  • "Can you do a rolling backfill that does not require downtime?" Yes. Partition REPLACE means each partition is atomic. Consumers querying other partitions are unaffected. Only the partition being replaced is briefly inconsistent.
KEY TAKEAWAYS
Say: 'Partition-level backfill. Each partition is independent, idempotent, and validated before moving on.'
Lower priority than production. Separate worker pool or queue.
Backfill upstream first, then downstream. Order matters.

Master the incremental loading patterns that interviewers probe hardest

Category
Pipeline Architecture
Difficulty
intermediate
Duration
25 minutes
Challenges
0 hands-on challenges

Topics covered: Merge Strategies, CDC Patterns, SCD in Pipelines, Schema Migration, Partition-Level Backfill

Lesson Sections

  1. Merge Strategies (concepts: paFullVsIncremental)

    What They Want to Hear 'I pick the merge strategy based on table size and access pattern. For tables under 100M rows, MERGE/UPSERT on the primary key is straightforward and correct. For larger tables, I use partition REPLACE: delete the entire partition for the date range, then insert fresh data. This avoids the row-level matching that makes MERGE slow at scale.' This is the answer that shows you have hit the performance wall and solved it.

  2. CDC Patterns (concepts: paCdc)

    What They Want to Hear 'I use WAL-based CDC because it has near-zero impact on the source database. Debezium reads the Postgres WAL or MySQL binlog, streams change events to Kafka, and my pipeline consumes from Kafka to apply inserts, updates, and deletes to the target. I avoid trigger-based CDC because triggers add latency to every write on the source and are fragile at scale.' This is the answer that shows you have run CDC in production and understand the operational tradeoffs.

  3. SCD in Pipelines (concepts: paScdPipeline)

    What They Want to Hear 'I implement SCD Type 2 with a MERGE statement that does two things: when a matching row's attributes have changed, it closes the current row by setting end_date and is_current = false, and inserts a new row with the updated values. The surrogate key is a hash of the business key plus the start_date, which makes it deterministic and idempotent.' This is the answer that shows you have built SCD pipelines, not just drawn them on a whiteboard.

  4. Schema Migration (concepts: paSchemaEvolution)

    What They Want to Hear 'I enforce backwards compatibility by default. New columns are added with a default value. Old columns are never removed in the same release as the new ones: I deprecate first, migrate consumers, then remove. For breaking changes, I version the schema and run both versions in parallel during the migration window.' This is the answer that shows you think about consumers, not just your own pipeline.

  5. Partition-Level Backfill (concepts: paBackfill)

    What They Want to Hear 'I backfill at the partition level. Each partition is an independent unit of work: I can re-run it without affecting other partitions. In Airflow, I use the catchup feature or a dedicated backfill DAG with a configurable date range. I run backfills with lower priority than production tasks and validate each partition before moving to the next.' This is the answer that shows you have done this operationally, not theoretically.