Loading section...

The Catalyst Optimizer

Concepts: paDistributedPrimitives

What They Want to Hear 'Catalyst is Spark's query optimizer. It takes my logical plan and rewrites it for efficiency. Three key optimizations: predicate pushdown moves filters as close to the data source as possible so fewer rows are read. Column pruning drops columns I never use so less data moves through the pipeline. Join reordering puts the smaller table on the build side of the join. I do not need to hand-optimize most of this because Catalyst does it, but I need to understand what it does so I know when it cannot help.' This is the answer that shows you trust the optimizer but know its limits.