Keeping Data Fresh: Intermediate
You said 'incremental loading.' The interviewer leans in: 'Walk me through your merge strategy.' Or: 'How do you handle a schema change mid-stream?' Or: 'Your CDC pipeline is 6 hours behind. What do you do?' These questions test whether you have operated incremental pipelines in production, not just described them.
Merge Strategies
Pick the right merge strategy and defend it under pressure
When you hear these in an interview, this is the concept being tested
- ▸"Walk me through your merge strategy."
- ▸"How do you handle upserts at scale?"
- ▸"MERGE is slow on your billion-row table. What do you do?"
What They Want to Hear
| Strategy | How It Works | Best For | Watch Out For |
|---|---|---|---|
| MERGE/UPSERT | Match on key, insert new, update existing | Tables < 100M rows, random updates | Slow at scale; full table scan to match |
| Partition REPLACE | Delete partition, insert fresh data | Date-partitioned tables, daily batch | Must reprocess entire partition |
| Append + deduplicate | Append all, deduplicate on read or in a later step | Immutable event stores, streaming sinks | Storage grows; read-time dedup costs |
After your initial answer, expect these probes
- ▸"What if updates span multiple partitions?" Stage the delta into a temp table, then MERGE only against the affected partitions. This scopes the match to a fraction of the table.
- ▸"How do you guarantee consistency during a partition REPLACE?" Use a transaction. Delete and insert in the same transaction so readers never see a partial state. Delta Lake and Iceberg handle this natively with ACID.
- ▸"What about late-arriving data from yesterday landing in today's run?" Route it to the correct partition by the event timestamp, not the processing timestamp. Then the next backfill or partition REPLACE corrects it.
CDC Patterns
Explain CDC end-to-end and diagnose lag
When you hear these in an interview, this is the concept being tested
- ▸"How does your CDC pipeline work end to end?"
- ▸"WAL-based or trigger-based: which do you prefer and why?"
- ▸"Your CDC replication is 6 hours behind. Diagnose it."
What They Want to Hear
- Reads the database's existing change log
- Near-zero impact on source performance
- Captures all operations including deletes
- Requires WAL retention config on the source
- Fires custom code on every row change
- Adds latency to every source write
- Fragile: triggers can be disabled or forgotten
- Works on databases without WAL access
After your initial answer, expect these probes
- ▸"Your CDC pipeline is lagging 6 hours. What do you check?" Three things: (1) WAL retention: is the source database purging WAL segments before Debezium reads them? (2) Kafka consumer lag: is the consumer falling behind? (3) Target write throughput: is the MERGE or apply step the bottleneck?
- ▸"What happens when the source database does a DDL change like ALTER TABLE?" WAL-based CDC captures DDL events. Debezium emits a schema change event. Your consumer must handle it: auto-add new columns or alert for breaking changes.
- ▸"How do you handle the initial snapshot?" Debezium takes a snapshot of the full table first, then switches to streaming WAL changes. The snapshot is the baseline; the stream is the delta going forward.
SCD in Pipelines
Implement SCD Type 2 with MERGE and defend surrogate key design
When you hear these in an interview, this is the concept being tested
- ▸"How do you implement SCD Type 2 in your pipeline?"
- ▸"Walk me through the MERGE logic for slowly changing dimensions."
- ▸"How do you generate surrogate keys in a distributed system?"
What They Want to Hear
After your initial answer, expect these probes
- ▸"Why hash for surrogate keys instead of auto-increment?" Hashing is deterministic. Re-running the pipeline produces the same surrogate key, which makes the pipeline idempotent. Auto-increment generates different keys on re-run.
- ▸"What if you get multiple changes for the same customer in one batch?" Sort by change timestamp, apply in order. Only the final state becomes the current row. Intermediate states get compressed into closed historical rows.
- ▸"How do you test SCD logic?" Run the pipeline twice with the same input and verify the dimension table is identical. Then run with a known change and verify exactly one row closes and one new row appears.
Schema Migration
Design a schema migration strategy with backwards compatibility
When you hear these in an interview, this is the concept being tested
- ▸"A source team renames a column. How do you handle it?"
- ▸"How do you roll out a schema change across 50 downstream consumers?"
- ▸"What is backwards compatibility in the context of data pipelines?"
What They Want to Hear
| Change Type | Strategy | Consumer Impact |
|---|---|---|
| Add column | Add with default value, notify consumers | None if default is set |
| Rename column | Add new name, copy data, deprecate old name, remove after migration | Requires consumer update |
| Remove column | Deprecation warning, migration window, then drop | Breaks consumers reading that column |
| Change data type | New column with new type, backfill, migrate consumers, drop old | Requires coordinated release |
After your initial answer, expect these probes
- ▸"How do you prevent a source team from making breaking changes without telling you?" Schema registry. The source publishes its schema. Your pipeline validates incoming data against the registered schema. Breaking changes fail validation before corrupting downstream tables.
- ▸"What about schema changes in event streams?" Avro with a schema registry (Confluent Schema Registry) enforces compatibility rules: BACKWARD, FORWARD, or FULL. The registry rejects schemas that break the compatibility mode.
- ▸"How long should a deprecation window be?" Long enough for every consumer to migrate. In practice, 2-4 weeks with active communication. Track who still reads the deprecated column and follow up.
Partition-Level Backfill
Design a production-safe backfill strategy with Airflow
When you hear these in an interview, this is the concept being tested
- ▸"You need to backfill 90 days of data. Walk me through it."
- ▸"How do you backfill without impacting production?"
- ▸"How do you coordinate backfills with Airflow?"
What They Want to Hear
After your initial answer, expect these probes
- ▸"What if the backfill takes longer than the time until the next daily run?" Run backfills in a separate worker pool. Production DAGs have higher priority. If they conflict on the same partition, production wins and backfill retries.
- ▸"How do you handle downstream dependencies during a backfill?" Downstream tables are usually partitioned the same way. Backfill upstream first, then trigger downstream for the same date range. Do not let downstream run on stale upstream data.
- ▸"Can you do a rolling backfill that does not require downtime?" Yes. Partition REPLACE means each partition is atomic. Consumers querying other partitions are unaffected. Only the partition being replaced is briefly inconsistent.
Master the incremental loading patterns that interviewers probe hardest
- Category
- Pipeline Architecture
- Difficulty
- intermediate
- Duration
- 25 minutes
- Challenges
- 0 hands-on challenges
Topics covered: Merge Strategies, CDC Patterns, SCD in Pipelines, Schema Migration, Partition-Level Backfill
Lesson Sections
- Merge Strategies (concepts: paFullVsIncremental)
What They Want to Hear 'I pick the merge strategy based on table size and access pattern. For tables under 100M rows, MERGE/UPSERT on the primary key is straightforward and correct. For larger tables, I use partition REPLACE: delete the entire partition for the date range, then insert fresh data. This avoids the row-level matching that makes MERGE slow at scale.' This is the answer that shows you have hit the performance wall and solved it.
- CDC Patterns (concepts: paCdc)
What They Want to Hear 'I use WAL-based CDC because it has near-zero impact on the source database. Debezium reads the Postgres WAL or MySQL binlog, streams change events to Kafka, and my pipeline consumes from Kafka to apply inserts, updates, and deletes to the target. I avoid trigger-based CDC because triggers add latency to every write on the source and are fragile at scale.' This is the answer that shows you have run CDC in production and understand the operational tradeoffs.
- SCD in Pipelines (concepts: paScdPipeline)
What They Want to Hear 'I implement SCD Type 2 with a MERGE statement that does two things: when a matching row's attributes have changed, it closes the current row by setting end_date and is_current = false, and inserts a new row with the updated values. The surrogate key is a hash of the business key plus the start_date, which makes it deterministic and idempotent.' This is the answer that shows you have built SCD pipelines, not just drawn them on a whiteboard.
- Schema Migration (concepts: paSchemaEvolution)
What They Want to Hear 'I enforce backwards compatibility by default. New columns are added with a default value. Old columns are never removed in the same release as the new ones: I deprecate first, migrate consumers, then remove. For breaking changes, I version the schema and run both versions in parallel during the migration window.' This is the answer that shows you think about consumers, not just your own pipeline.
- Partition-Level Backfill (concepts: paBackfill)
What They Want to Hear 'I backfill at the partition level. Each partition is an independent unit of work: I can re-run it without affecting other partitions. In Airflow, I use the catchup feature or a dedicated backfill DAG with a configurable date range. I run backfills with lower priority than production tasks and validate each partition before moving to the next.' This is the answer that shows you have done this operationally, not theoretically.