What About Data Skew?

Concepts: paShuffleOptimization, paDistributedPrimitives

"Your join is fast for most keys but one key takes 10x longer. What is happening and how do you fix it?" Data skew is the single most common root cause of slow Spark jobs in production. Every interviewer expects you to handle it. Detecting Skew In the Spark UI, skew shows up as one task running dramatically longer than others in the same stage. The Tasks tab shows the min, median, and max task duration. If max is 50x the median, one partition holds massively more data than the others. Check the shuffle read size per task - the long task will show orders of magnitude more bytes. Salting: The Standard Fix Salting splits a hot key into N artificial keys, distributing its rows across N partitions. Add a random salt (0 to N-1) to the skewed side, replicate the other side N times with each sal