DataDriven
LearnPracticeInterviewDiscussDaily
HelpContactPrivacyTermsSecurityiOS App

© 2026 DataDriven

Loading lesson...

  1. Home
  2. Learn
  3. Keeping Data Fresh

Keeping Data Fresh

The staff-level incremental loading questions that separate hire from strong hire

Challenges
0 hands-on challenges

Lesson Sections

  1. Hybrid Loading Strategies (concepts: paFullVsIncremental)

    What They Want to Hear 'I run a cost crossover analysis. Incremental is cheaper when the delta is small relative to the full table. But when the delta exceeds roughly 30-40% of the table, full refresh is actually cheaper because it avoids the matching overhead. My default: incremental daily with a scheduled full refresh weekly. For tables with high churn, I adjust the crossover threshold based on observed merge duration vs full reload duration.' This is the answer that shows you think about incr

  2. CDC at Scale (concepts: paCdc)

    What They Want to Hear 'Each source gets its own CDC connector, its own Kafka topic, and its own consumer. Failure isolation is the design principle: one source lagging does not block the others. I monitor three metrics per source: replication lag, event throughput, and error rate. When one falls behind, I diagnose independently: is it the WAL, the connector, or the consumer? Then I scale that one connector without touching the others.' This is the answer that shows you have operated CDC as a pl

  3. SCD in Streaming (concepts: paScdPipeline)

    What They Want to Hear 'In streaming, SCD Type 2 becomes a stateful operation. Each change event is compared against the current state in a key-value store. If the tracked attributes differ, the consumer emits a close event for the old version and an open event for the new version. The challenge is ordering: out-of-order events can close a row that was already updated by a later event. I handle this with event-time ordering and a grace period before finalizing row closures.' This is the answer t

  4. Schema Contracts (concepts: paSchemaEvolution)

    What They Want to Hear 'I treat schema evolution as a platform service, not a per-pipeline concern. Producers publish a schema contract that defines the fields, types, and compatibility guarantees. Consumers register their dependencies. The platform enforces compatibility rules at publish time: if a proposed change would break a registered consumer, the publish is rejected. This shifts schema validation from runtime failures to build-time rejections.' This is the answer that shows you think abou

  5. Petabyte Backfill (concepts: paBackfill)

    What They Want to Hear 'At petabyte scale, backfill is a project, not a task. I start with a cost estimate: compute hours, storage reads, and expected duration. Then I design progressive backfill: process the most recent data first so consumers get value immediately, then work backwards in priority order. I set a daily cost cap and adjust concurrency to stay within budget. Each partition writes to a shadow table first; only after validation does it swap into production.' This is the answer that

Related

  • All Lessons
  • Practice Problems
  • Mock Interview Practice
  • Daily Challenges