Loading lesson...
Defend your architecture decisions with cost modeling and production patterns
What They Want to Hear 'In practice, Lambda means running Spark batch alongside Spark Structured Streaming, both writing to the same serving layer. The speed layer writes to a real-time view (e.g., a Kafka-backed materialized view or a hot table). The batch layer overwrites the same data with corrected results daily. I use view unioning: the serving query reads from the batch table first, then overlays any newer data from the speed table. Consistency is eventual: the batch layer corrects any dri
What They Want to Hear 'Kappa works when three conditions are met: the event log retains enough history for full reprocessing, the streaming logic can handle both real-time and replay workloads, and the team has the operational maturity to run a streaming platform 24/7. For reprocessing, I deploy a second instance of the streaming job, point it at the beginning of the log, and write to a new output table. When the replay catches up to real-time, I swap the consumer to the new table. The old tabl
What They Want to Hear 'Pushdown behavior depends on the storage format. Parquet supports column pruning and row-group statistics filtering natively. Delta Lake adds data skipping with file-level min/max statistics and Z-ordering for multi-column predicates. Iceberg goes further with partition-level statistics and hidden partitioning that decouples the physical partition scheme from the query predicate. Dynamic partition pruning optimizes joins: when one side of a join filters partitions, the en
What They Want to Hear 'I right-size by measuring utilization first. If CPU utilization averages 30% on a $80K/month cluster, I am paying for 70% idle capacity. Three strategies: (1) Auto-scaling: scale executors based on workload, not peak capacity. (2) Spot instances for batch: 60-70% discount with checkpointing for fault tolerance. (3) Reserved instances for baseline: commit to the minimum always-needed capacity at 30-50% discount, scale above that with on-demand or spot.' This is the answer
What They Want to Hear 'MERGE matches source rows against target rows on a key. When matched, it updates. When not matched, it inserts. For idempotency, the key must be the business key (e.g., order_id), not a surrogate. To optimize a slow MERGE: (1) partition the target table and scope the MERGE to affected partitions only, (2) stage the delta into a temp table with dedup applied before merging, (3) if the delta exceeds 30% of the partition, switch to partition REPLACE instead.' This is the ans