Loading lesson...

Distributed Compute

Debug slow Spark jobs and defend your cluster sizing decisions

Challenges: 0 hands-on challenges

Lesson Sections

Stage Boundaries (concepts: paSparkExecutionModel)
What They Want to Hear 'Every shuffle creates a stage boundary. Within a stage, all transformations run as a pipeline on each partition without data movement. Between stages, data must be redistributed. To optimize, I look at the Spark UI for the stage with the most shuffle read/write or the longest duration. That is where the bottleneck is. If one stage takes 90% of the time, that is the only stage worth optimizing.' This is the answer that shows you debug from metrics, not from guessing.
The Catalyst Optimizer (concepts: paDistributedPrimitives)
What They Want to Hear 'Catalyst is Spark's query optimizer. It takes my logical plan and rewrites it for efficiency. Three key optimizations: predicate pushdown moves filters as close to the data source as possible so fewer rows are read. Column pruning drops columns I never use so less data moves through the pipeline. Join reordering puts the smaller table on the build side of the join. I do not need to hand-optimize most of this because Catalyst does it, but I need to understand what it does
Join Optimization (concepts: paShuffleOptimization)
What They Want to Hear 'I choose the join strategy based on table sizes. If one side fits in memory (under 10MB by default, but I raise it to 100-500MB for medium tables), I broadcast it to avoid a shuffle entirely. For two large tables, sort-merge join is the default: both sides are sorted by the join key, then merged. If one key is heavily skewed, I salt it: append a random integer to the hot key, join on the salted key, then aggregate to remove the salt.' This is the answer that shows you hav
Cluster Sizing (concepts: paMemoryManagement)
What They Want to Hear 'I use the 5-core rule as a starting point. Each executor gets 5 cores, which balances parallelism against HDFS throughput and JVM overhead. For 2TB of data with 128MB target partitions, I need about 16,000 partitions. With 5 cores per executor and 20 executors, I process 100 tasks concurrently. The job finishes in ceiling(16,000 / 100) = 160 waves. If each wave takes 30 seconds, that is about 80 minutes.' This is the answer that shows you can do the back-of-envelope math
Compaction Strategies (concepts: paSmallFiles)
What They Want to Hear 'I run compaction as a scheduled maintenance job. For Delta Lake tables, OPTIMIZE rewrites small files into target-sized files without rewriting the entire table. For Iceberg, the rewrite_data_files action does the same. I schedule compaction after the pipeline writes and before downstream reads, so readers always see optimized files. I also set auto-compaction for streaming tables that produce many small files per micro-batch.' This is the answer that shows you treat comp