Change data capture (CDC) pipeline interview questions for data engineer roles. Debezium watching Postgres WAL or MySQL binlog. Kafka Connect S3 Sink for raw immutable change events. Spark daily ETL with composite-key dedup. Snowflake MERGE INTO on (pk, run_id) for idempotent upsert. The canonical pattern for syncing transactional data to a warehouse.

Change data capture (CDC) pipeline design is the canonical pattern for syncing transactional source databases to an analytical warehouse without batching SELECT * every night. CDC reads the database's transaction log (Postgres write-ahead log, MySQL binlog, MongoDB oplog, SQL Server CDC) and emits change events (insert, update, delete) downstream. Modern CDC stacks in 2026 use Debezium as the connector framework, Kafka as the event bus, Kafka Connect S3 Sink for raw landing, Spark for transformation, and Snowflake MERGE INTO for warehouse upsert.

The canonical CDC architecture for a data engineer interview round. Step 1: Debezium connector watches the source database transaction log. For Postgres, this requires logical replication slots (configured with wal_level = logical, max_replication_slots). Debezium emits change events to Kafka with one topic per source table. Each event includes the before-state, after-state, and operation type. Step 2: Kafka topic per table with retention 7 days for replay. Step 3: Kafka Connect S3 Sink writes raw immutable change events to S3 partitioned by source table and date. Step 4: Spark daily ETL reads the S3 partitions, applies dedup on the composite natural key (table_pk, op_ts), and writes to a staging schema in Snowflake. Step 5: Snowflake MERGE INTO upserts from the staging schema to the production gold tables, with run_id baked into the output for backfill safety.

Five recurring failure modes in CDC pipeline design rounds. Debezium falls behind: the source database commits faster than Debezium can read the WAL; replication lag grows. Mitigation: monitor replication lag, scale Debezium connector instances, restart from snapshot if lag becomes unrecoverable. Schema change at source: a column is added or renamed in the source table; the Debezium event schema diverges from the destination. Mitigation: schema registry (Confluent Schema Registry or AWS Glue Schema Registry) enforces schema evolution rules, alerts on incompatible changes. Out-of-order events: rare with single-partition topics keyed by primary key, common with multi-partition topics. Mitigation: key the topic by primary key so all events for one row land on the same partition (guarantees in-order per row). Dedup needed: at-least-once delivery means the same change event can arrive twice; dedup on (pk, op_ts) with ROW_NUMBER, keep first. Snowflake MERGE deadlock: concurrent writer holds the table lock. Mitigation: serialize via Snowflake task or queue, or use Snowflake's auto-commit pattern with smaller transactions.

The MERGE INTO pattern for idempotent CDC upsert. MERGE INTO prod_table p USING staging_table s ON p.pk = s.pk WHEN MATCHED AND s.op_type = 'DELETE' THEN DELETE WHEN MATCHED AND s.op_type IN ('UPDATE', 'INSERT') THEN UPDATE SET col1 = s.col1, col2 = s.col2, updated_at = s.op_ts WHEN NOT MATCHED AND s.op_type IN ('UPDATE', 'INSERT') THEN INSERT (pk, col1, col2, updated_at) VALUES (s.pk, s.col1, s.col2, s.op_ts). The op_ts is the source database's commit timestamp; using it for updated_at preserves source ordering. Run_id can be baked into the staging table for backfill: re-running with the same run_id produces the same upsert result.

CDC Pipeline Interview Questions

Change data capture pipeline design for data engineer interview prep.

123 practice problems matching this filter. Difficulty: medium (57), hard (66).

Pipeline Architecture (123)

Common questions

What is the canonical CDC pipeline architecture for a data engineer interview?
Debezium connector watches the source database transaction log (Postgres WAL, MySQL binlog) and emits change events to Kafka. Kafka Connect S3 Sink writes raw immutable events to S3 partitioned by source table and date. Spark daily ETL reads S3, dedups on (pk, op_ts), writes to Snowflake staging. Snowflake MERGE INTO on the staging schema upserts to production gold with run_id baked in.
Why use Debezium instead of polling SELECT MAX(updated_at)?
Three reasons. Debezium captures every change including soft deletes and intermediate updates between polls. Debezium does not load the source database with repeated SELECT queries. Debezium provides at-least-once delivery with snapshot recovery for new tables. Polling misses deletes (the row is gone), misses intermediate updates (only the latest state is visible), and adds load to the source production database.
How does a data engineer handle schema changes in a CDC pipeline?
Schema registry (Confluent Schema Registry, AWS Glue Schema Registry) enforces schema evolution rules at the producer side. Additive changes (new nullable column) propagate automatically. Breaking changes (renamed column, narrowed type) refuse to publish. Downstream bronze layer keeps raw payload as JSON for replay after writing a new mapping. Producer-side contract tests in CI are worth more than any amount of pipeline-side validation.
How does a data engineer ensure CDC events are processed in order?
Key the Kafka topic by primary key so all events for one row land on the same partition. Within a partition, Kafka preserves order. Multi-partition topics keyed by something other than primary key (or unkeyed) can produce out-of-order events for the same row, which breaks the upsert semantics. Use op_ts in the MERGE clause to handle the rare out-of-order case.
What is the MERGE INTO pattern for idempotent CDC upsert?
MERGE INTO prod_table p USING staging_table s ON p.pk = s.pk WHEN MATCHED AND s.op_type = 'DELETE' THEN DELETE WHEN MATCHED AND s.op_type IN ('UPDATE', 'INSERT') THEN UPDATE SET col = s.col, updated_at = s.op_ts WHEN NOT MATCHED AND s.op_type IN ('UPDATE', 'INSERT') THEN INSERT VALUES (s.pk, s.col, s.op_ts). The op_ts is the source commit timestamp; using it preserves source ordering. Re-running with the same staging data produces the same result.
How does a data engineer handle Debezium falling behind on replication?
Monitor replication lag (Debezium exposes metrics; CloudWatch or Prometheus alerts on lag thresholds). Scale connector instances if lag grows. If lag becomes unrecoverable (Debezium has fallen so far behind that the WAL has rotated past its position), restart from snapshot: Debezium re-snapshots the current state of every table, then resumes streaming from the current WAL position. The bronze layer's raw events allow replay over the snapshot gap.
What is the difference between CDC and ELT?
CDC is a specific mechanism for capturing changes from a source database. ELT is a pattern for loading raw data into a warehouse and transforming inside the warehouse. CDC is often the source of an ELT pipeline: Debezium captures changes, lands raw events in S3 bronze, dbt or Spark transforms inside the warehouse to silver and gold. ELT can also use non-CDC sources (event streams, API pulls, batch dumps); CDC is one input mechanism.
Can a CDC pipeline achieve exactly-once semantics?
Exactly-once effect, yes. Pure message-level exactly-once is impossible across producer-broker-consumer boundaries. The data engineer achieves exactly-once effect with at-least-once delivery (Debezium with snapshot recovery, Kafka with replication) plus idempotent processing (dedup on pk+op_ts, MERGE INTO with run_id, downstream tables FK to source pk). Re-running the pipeline produces the same warehouse state.