Stage Boundaries

Concepts: paSparkExecutionModel

What They Want to Hear 'Every shuffle creates a stage boundary. Within a stage, all transformations run as a pipeline on each partition without data movement. Between stages, data must be redistributed. To optimize, I look at the Spark UI for the stage with the most shuffle read/write or the longest duration. That is where the bottleneck is. If one stage takes 90% of the time, that is the only stage worth optimizing.' This is the answer that shows you debug from metrics, not from guessing.