Loading lesson...

Keeping Data Fresh

The staff-level incremental loading questions that separate hire from strong hire

Challenges: 0 hands-on challenges

Lesson Sections

Hybrid Loading Strategies (concepts: paFullVsIncremental)
What They Want to Hear 'I run a cost crossover analysis. Incremental is cheaper when the delta is small relative to the full table. But when the delta exceeds roughly 30-40% of the table, full refresh is actually cheaper because it avoids the matching overhead. My default: incremental daily with a scheduled full refresh weekly. For tables with high churn, I adjust the crossover threshold based on observed merge duration vs full reload duration.' This is the answer that shows you think about incr
CDC at Scale (concepts: paCdc)
What They Want to Hear 'Each source gets its own CDC connector, its own Kafka topic, and its own consumer. Failure isolation is the design principle: one source lagging does not block the others. I monitor three metrics per source: replication lag, event throughput, and error rate. When one falls behind, I diagnose independently: is it the WAL, the connector, or the consumer? Then I scale that one connector without touching the others.' This is the answer that shows you have operated CDC as a pl
SCD in Streaming (concepts: paScdPipeline)
What They Want to Hear 'In streaming, SCD Type 2 becomes a stateful operation. Each change event is compared against the current state in a key-value store. If the tracked attributes differ, the consumer emits a close event for the old version and an open event for the new version. The challenge is ordering: out-of-order events can close a row that was already updated by a later event. I handle this with event-time ordering and a grace period before finalizing row closures.' This is the answer t
Schema Contracts (concepts: paSchemaEvolution)
What They Want to Hear 'I treat schema evolution as a platform service, not a per-pipeline concern. Producers publish a schema contract that defines the fields, types, and compatibility guarantees. Consumers register their dependencies. The platform enforces compatibility rules at publish time: if a proposed change would break a registered consumer, the publish is rejected. This shifts schema validation from runtime failures to build-time rejections.' This is the answer that shows you think abou
Petabyte Backfill (concepts: paBackfill)
What They Want to Hear 'At petabyte scale, backfill is a project, not a task. I start with a cost estimate: compute hours, storage reads, and expected duration. Then I design progressive backfill: process the most recent data first so consumers get value immediately, then work backwards in priority order. I set a daily cost cap and adjust concurrency to stay within budget. Each partition writes to a shadow table first; only after validation does it swap into production.' This is the answer that