Loading lesson...
The staff-level Spark questions: AQE, GC tuning, and when Spark is the wrong tool
What They Want to Hear 'AQE re-optimizes the query plan at runtime using actual data statistics from completed stages. Three key optimizations: it coalesces small post-shuffle partitions into larger ones, it switches join strategies when runtime statistics show one side is smaller than the optimizer predicted, and it handles skew by splitting large partitions into sub-partitions. The traditional optimizer uses table-level statistics that can be stale; AQE uses partition-level statistics that are
What They Want to Hear 'The cost-based optimizer uses table and column statistics to choose between join strategies, predicate ordering, and aggregation methods. When statistics are stale, the optimizer makes wrong decisions: it might sort-merge join a 50MB table instead of broadcasting it. I run ANALYZE TABLE periodically for critical tables, and I rely on AQE as a runtime fallback. For UDFs, the optimizer is blind: it cannot push predicates through a UDF or estimate its output cardinality. Thi
What They Want to Hear 'By default, shuffle data is stored on the executor that produced it. If that executor dies, all downstream stages that need its shuffle data must re-compute. The external shuffle service decouples shuffle storage from executors: shuffle files live on the node's local disk and survive executor restarts. This is critical for dynamic allocation, where executors are added and removed frequently. Without it, removing an idle executor destroys its shuffle data and causes expens
What They Want to Hear 'High GC time means the JVM is spending too long cleaning up objects. In Spark, this usually means the executor is holding too many objects in memory: large hash maps from joins, cached datasets, or broadcast variables. My debugging process: (1) check GC logs for the ratio of young-gen vs full GC, (2) check the Spark UI for spill metrics on that executor, (3) check if a skewed partition is forcing one executor to hold disproportionate data. The fix is usually repartitionin
What They Want to Hear 'At platform scale, I cannot manually tune compaction for every table. I build a metadata-driven compaction service: it monitors file count and average file size per partition, scores partitions by read frequency times file count, and prioritizes compaction where the read improvement justifies the compute cost. Tables that are never queried do not get compacted. Tables queried 1000 times per day get compacted immediately. This is cost optimization as an engineering system,