Loading lesson...
Debug slow Spark jobs and defend your cluster sizing decisions
What They Want to Hear 'Every shuffle creates a stage boundary. Within a stage, all transformations run as a pipeline on each partition without data movement. Between stages, data must be redistributed. To optimize, I look at the Spark UI for the stage with the most shuffle read/write or the longest duration. That is where the bottleneck is. If one stage takes 90% of the time, that is the only stage worth optimizing.' This is the answer that shows you debug from metrics, not from guessing.
What They Want to Hear 'Catalyst is Spark's query optimizer. It takes my logical plan and rewrites it for efficiency. Three key optimizations: predicate pushdown moves filters as close to the data source as possible so fewer rows are read. Column pruning drops columns I never use so less data moves through the pipeline. Join reordering puts the smaller table on the build side of the join. I do not need to hand-optimize most of this because Catalyst does it, but I need to understand what it does
What They Want to Hear 'I choose the join strategy based on table sizes. If one side fits in memory (under 10MB by default, but I raise it to 100-500MB for medium tables), I broadcast it to avoid a shuffle entirely. For two large tables, sort-merge join is the default: both sides are sorted by the join key, then merged. If one key is heavily skewed, I salt it: append a random integer to the hot key, join on the salted key, then aggregate to remove the salt.' This is the answer that shows you hav
What They Want to Hear 'I use the 5-core rule as a starting point. Each executor gets 5 cores, which balances parallelism against HDFS throughput and JVM overhead. For 2TB of data with 128MB target partitions, I need about 16,000 partitions. With 5 cores per executor and 20 executors, I process 100 tasks concurrently. The job finishes in ceiling(16,000 / 100) = 160 waves. If each wave takes 30 seconds, that is about 80 minutes.' This is the answer that shows you can do the back-of-envelope math
What They Want to Hear 'I run compaction as a scheduled maintenance job. For Delta Lake tables, OPTIMIZE rewrites small files into target-sized files without rewriting the entire table. For Iceberg, the rewrite_data_files action does the same. I schedule compaction after the pipeline writes and before downstream reads, so readers always see optimized files. I also set auto-compaction for streaming tables that produce many small files per micro-batch.' This is the answer that shows you treat comp