Loading lesson...
Capturing changes from source databases without breaking them is the core data engineering skill
Capturing changes from source databases without breaking them is the core data engineering skill
Topics covered: "How Do You Get Data Out of the Source Database?", Log-Based CDC: Debezium and the WAL, Query-Based CDC and Its Limitations, The Dual-Write Problem, CDC Pipeline End-to-End
What They're Really Testing The Three Extraction Methods The Unlock Why Companies Care At Uber, query-based CDC on the ride table missed soft deletes (canceled rides) because the application set a status flag instead of updating the updated_at column. At Netflix, full-load extraction of the content metadata table took 4 hours and blocked the source database's connection pool during peak streaming hours. At Stripe, log-based CDC (Debezium) captures every payment state change with sub-second laten
Debezium is the de facto standard for log-based CDC. It connects to the database's replication slot (PostgreSQL) or binlog (MySQL), reads committed changes, and publishes them as structured events to Kafka topics. One topic per table. Each event contains the before and after state of the row, plus metadata. Debezium Change Event Structure Why Log-Based CDC Is Non-Invasive
Query-based CDC is the naive approach that most candidates propose first. It works by querying for rows where updated_at > last_run_time. It is simple to implement but has fundamental limitations that the interviewer will probe. The Three Failures of Query-Based CDC When Query-Based CDC Is Acceptable
A dual write is when the application writes to two systems: the database and a message queue (or search index, or cache). It sounds reasonable but it creates a consistency problem that has no simple fix. This is the trap the interviewer is waiting for. The Dual-Write Failure Modes The Fix: Outbox Pattern When Dual Writes Are Acceptable
The interview closer: design a complete CDC pipeline from source database to warehouse, addressing every component. This is the 'narrate the data journey' answer that interviewers describe as the strongest possible signal. The Architecture The Follow-Up Traps Vocabulary That Signals Seniority