Optimize This Join

Concepts: paShuffleOptimization, paDistributedPrimitives

The join question goes beyond 'use a broadcast join.' The interviewer wants you to reason about shuffle internals, Tungsten memory management, and when AQE's automatic strategy switching helps or hurts. Sort-Merge Join Internals Sort-merge join is Spark's default for two large tables. Both sides are shuffled by the join key, then each partition is sorted. The merge phase walks both sorted partitions with two pointers, producing matches in O(n + m). The cost: two full shuffles plus two sorts. The sort uses Tungsten's off-heap memory for binary comparison, avoiding Java object overhead. Hash shuffle was removed in Spark 2.0 in favor of sort-based shuffle (SortShuffleManager). Sort-based shuffle produces one file per map task (not one file per map-reduce pair), dramatically reducing file hand