Auto-Optimize at Scale

Concepts: paSmallFiles

What They Want to Hear 'At platform scale, I cannot manually tune compaction for every table. I build a metadata-driven compaction service: it monitors file count and average file size per partition, scores partitions by read frequency times file count, and prioritizes compaction where the read improvement justifies the compute cost. Tables that are never queried do not get compacted. Tables queried 1000 times per day get compacted immediately. This is cost optimization as an engineering system, not a manual chore.' This is the answer that shows you think about optimization as a platform problem. When Spark Is the Wrong Tool Knowing when NOT to use Spark is the strongest staff-level signal. Spark's overhead (JVM startup, DAG planning, shuffle infrastructure) is wasteful for small datasets.