DataDriven
LearnPracticeInterviewDiscussDaily
HelpContactPrivacyTermsSecurityiOS App

© 2026 DataDriven

Loading lesson...

  1. Home
  2. Learn
  3. Architecture at Scale

Architecture at Scale

Defend your architecture decisions with cost modeling and production patterns

Challenges
0 hands-on challenges

Lesson Sections

  1. Lambda in Practice (concepts: paLambdaArch)

    What They Want to Hear 'In practice, Lambda means running Spark batch alongside Spark Structured Streaming, both writing to the same serving layer. The speed layer writes to a real-time view (e.g., a Kafka-backed materialized view or a hot table). The batch layer overwrites the same data with corrected results daily. I use view unioning: the serving query reads from the batch table first, then overlays any newer data from the speed table. Consistency is eventual: the batch layer corrects any dri

  2. When Kappa Works (concepts: paKappaArch)

    What They Want to Hear 'Kappa works when three conditions are met: the event log retains enough history for full reprocessing, the streaming logic can handle both real-time and replay workloads, and the team has the operational maturity to run a streaming platform 24/7. For reprocessing, I deploy a second instance of the streaming job, point it at the beginning of the log, and write to a new output table. When the replay catches up to real-time, I swap the consumer to the new table. The old tabl

  3. Pushdown Across Formats (concepts: paPredicatePushdown)

    What They Want to Hear 'Pushdown behavior depends on the storage format. Parquet supports column pruning and row-group statistics filtering natively. Delta Lake adds data skipping with file-level min/max statistics and Z-ordering for multi-column predicates. Iceberg goes further with partition-level statistics and hidden partitioning that decouples the physical partition scheme from the query predicate. Dynamic partition pruning optimizes joins: when one side of a join filters partitions, the en

  4. Right-Sizing Clusters (concepts: paCostOptimization)

    What They Want to Hear 'I right-size by measuring utilization first. If CPU utilization averages 30% on a $80K/month cluster, I am paying for 70% idle capacity. Three strategies: (1) Auto-scaling: scale executors based on workload, not peak capacity. (2) Spot instances for batch: 60-70% discount with checkpointing for fault tolerance. (3) Reserved instances for baseline: commit to the minimum always-needed capacity at 30-50% discount, scale above that with on-demand or spot.' This is the answer

  5. MERGE and UPSERT Patterns (concepts: paIdempotency)

    What They Want to Hear 'MERGE matches source rows against target rows on a key. When matched, it updates. When not matched, it inserts. For idempotency, the key must be the business key (e.g., order_id), not a surrogate. To optimize a slow MERGE: (1) partition the target table and scope the MERGE to affected partitions only, (2) stage the delta into a temp table with dedup applied before merging, (3) if the delta exceeds 30% of the partition, switch to partition REPLACE instead.' This is the ans

Related

  • All Lessons
  • Practice Problems
  • Mock Interview Practice
  • Daily Challenges