Loading section...

Shuffle Operations

Concepts: paShuffleOptimization

What They Want to Hear 'A shuffle redistributes data across executors. It happens when Spark needs to group or join data by a key, and the matching rows are spread across different partitions. Shuffles are expensive because every executor must write its data to disk, send it over the network, and every receiving executor must read and merge it. The number one way to avoid unnecessary shuffles is broadcast joins: if one side of the join is small enough to fit in memory, broadcast it to all executors so no shuffle is needed.' That is the answer. Shuffle = redistribute = expensive. Broadcast = avoid the shuffle.