Loading section...

Join Optimization

Concepts: paShuffleOptimization

What They Want to Hear 'I choose the join strategy based on table sizes. If one side fits in memory (under 10MB by default, but I raise it to 100-500MB for medium tables), I broadcast it to avoid a shuffle entirely. For two large tables, sort-merge join is the default: both sides are sorted by the join key, then merged. If one key is heavily skewed, I salt it: append a random integer to the hot key, join on the salted key, then aggregate to remove the salt.' This is the answer that shows you have debugged slow joins in production.