Loading lesson...
The staff-level incremental loading questions that separate hire from strong hire
What They Want to Hear 'I run a cost crossover analysis. Incremental is cheaper when the delta is small relative to the full table. But when the delta exceeds roughly 30-40% of the table, full refresh is actually cheaper because it avoids the matching overhead. My default: incremental daily with a scheduled full refresh weekly. For tables with high churn, I adjust the crossover threshold based on observed merge duration vs full reload duration.' This is the answer that shows you think about incr
What They Want to Hear 'Each source gets its own CDC connector, its own Kafka topic, and its own consumer. Failure isolation is the design principle: one source lagging does not block the others. I monitor three metrics per source: replication lag, event throughput, and error rate. When one falls behind, I diagnose independently: is it the WAL, the connector, or the consumer? Then I scale that one connector without touching the others.' This is the answer that shows you have operated CDC as a pl
What They Want to Hear 'In streaming, SCD Type 2 becomes a stateful operation. Each change event is compared against the current state in a key-value store. If the tracked attributes differ, the consumer emits a close event for the old version and an open event for the new version. The challenge is ordering: out-of-order events can close a row that was already updated by a later event. I handle this with event-time ordering and a grace period before finalizing row closures.' This is the answer t
What They Want to Hear 'I treat schema evolution as a platform service, not a per-pipeline concern. Producers publish a schema contract that defines the fields, types, and compatibility guarantees. Consumers register their dependencies. The platform enforces compatibility rules at publish time: if a proposed change would break a registered consumer, the publish is rejected. This shifts schema validation from runtime failures to build-time rejections.' This is the answer that shows you think abou
What They Want to Hear 'At petabyte scale, backfill is a project, not a task. I start with a cost estimate: compute hours, storage reads, and expected duration. Then I design progressive backfill: process the most recent data first so consumers get value immediately, then work backwards in priority order. I set a daily cost cap and adjust concurrency to stay within budget. Each partition writes to a shadow table first; only after validation does it swap into production.' This is the answer that