Optimize This Join

Concepts: paShuffleOptimization, paDistributedPrimitives

"You're joining a 500GB fact table with a 2GB dimension table and it's slow. How do you fix it?" The answer they want: broadcast the dimension table. But the follow-ups go deeper. Join Strategies in Spark The broadcast threshold is controlled by spark.sql.autoBroadcastJoinThreshold, default 10MB. If one side of the join is below this threshold, Spark ships the entire table to every executor, eliminating the shuffle entirely. For dimension tables up to ~1-2GB, you can force a broadcast even above the threshold. Bucketed Joins When both sides are large and you join them repeatedly on the same key, bucketing eliminates the shuffle at read time. You pre-shuffle the data at write time so that matching keys land in the same bucket file. Subsequent joins on the bucket key skip the shuffle entirel