Inside Catalyst: The Four Phases

A Spark performance specialist who opens a slow query reads the four stages it passes through on its way to running, because the trouble almost always lives in one specific stage. You already know Catalyst rewrites your query. This tier walks the assembly line itself: how a query is analyzed against the catalog, rewritten by rules, turned into a physical plan with real join strategies, and finally compiled. Optimization is not one step but four, and every one of them is a place you can read and reason about.

Phase One, Analysis: Resolving What You Named

Daily Life

Interviews

Catalyst's first move is unglamorous and unavoidable: it resolves your query against the catalog. When you wrote select category from products, the optimizer does not yet know that products is a real table, that category is a real column, or what type that column is. Analysis binds every name you used to a concrete thing in the catalog, the metadata store that knows which tables and columns exist and their types.

This is where the errors you actually see come from. A misspelled column name, a table that does not exist, a type mismatch in a comparison: these are caught in analysis, because analysis is the moment Spark checks your named references against reality. The output of this phase is the resolved logical plan, a version of your query where every column and table is bound to its real definition and every type is known. Nothing has been optimized yet; the query has only been made concrete and checked for sense.

It is useful to know that the plan starts even rougher than resolved. When you first write the query, Spark holds an unresolved logical plan in which names like products and category are mere placeholders with no binding yet. Analysis is the pass that turns that unresolved tree into the resolved one by looking every placeholder up in the catalog. This is why the same query text can fail in one session and resolve in another: if the table exists in the catalog the session points at, the placeholder binds; if it does not, analysis cannot complete and you get the cannot-be-resolved error. The text alone never settles the query; the catalog it resolves against settles the rest.

Bind names to the catalog

Tables and columns you referenced are matched to real definitions and types.

Catch reference errors here

Unknown columns, missing tables, and type mismatches fail in analysis, before any optimization.

Produce the resolved logical plan

A concrete, type-checked version of your query, ready for the optimizer to rewrite.

Understanding analysis explains a behaviour that confuses people coming from the lazy-evaluation lesson. Because transformations are lazy, you might expect a wrong column name to do nothing until an action. But DataFrame operations often resolve eagerly enough that a bad column name fails when you write it, because Spark analyzed that step against the catalog immediately. Analysis is the line between names you typed and columns that exist, and it runs early so you find out early.

Resolution does more quiet work than catching typos. It is also where implicit type coercions are inserted and where ambiguity is detected. When you join order_items to products and both carry a product_id, analysis is what decides whether a later reference to product_id is unambiguous or whether it must reject the query and demand you qualify it. When you compare an integer column to a decimal literal, analysis inserts the cast that makes the comparison well-typed. The AnalysisException you have almost certainly seen, the one that says a column is ambiguous or cannot be resolved, is this phase refusing to hand a query it does not fully understand to the optimizer. The payoff is that the optimizer never has to cope with an unresolved name.

The catalog itself is worth a mention because it returns in the cost-based optimization section. It does not only know names and types; it can also hold statistics about each table, how many rows, how large, the distribution of values in a column. Analysis binds the structure; later phases lean on those statistics to make cost decisions. A catalog with rich, fresh statistics gives the optimizer good information; a catalog with none forces it to guess, which is the seed of the problems the advanced tier returns to.

Two kinds of failure both feel like errors but live in different phases, and separating them saves you debugging time. A name that does not resolve, a missing table, a type that cannot be coerced: those are analysis failures, and they happen before any data is touched, often the instant you build the DataFrame. A row that violates a runtime expectation, a division by zero or a value that overflows a cast, is not an analysis failure at all; it happens during execution, long after the plan is fixed. Knowing which phase owns a given error tells you where to look: an unresolved-column message is a spelling or schema problem you can fix without running anything, while a runtime arithmetic error means the plan was sound and the data surprised it.

Phase Two, Logical Optimization: The Rule Rewrites

Daily Life

Interviews

With a resolved plan in hand, Catalyst enters the phase most people mean when they say optimization: logical optimization, a battery of rule-based rewrites that transform the plan into an equivalent but cheaper one. These are the optimizations from the beginner tier, now placed in their proper home. Each rule is a small, provably-correct transformation, and Catalyst applies them repeatedly until the plan stops changing.

The headline rules are the ones that move and shrink work. Predicate pushdown moves filters down toward the data source, so fewer rows flow through the expensive operations above. Projection pushdown, or column pruning, drops columns the query never uses, so only needed data is carried. Constant folding evaluates expressions that do not depend on the data once, at planning time. There are dozens more, but these three carry most of the everyday wins, and they share a principle: do less work, sooner, without changing the answer.

A subtle but high-value rule is null propagation through joins, which lets the optimizer convert an outer join to an inner join when a downstream filter makes the outer rows impossible. If you left-join customers to orders and then filter on orders.total being positive, the null rows that the outer join would have produced for customers with no orders can never pass that filter, so Catalyst rewrites the outer join as an inner one, which is cheaper to execute. You did not ask for an inner join, and the result is identical, but the optimizer proved the outer rows are dead and stopped paying for them. Rewrites like this are why a senior engineer trusts the optimizer with join shape and spends their attention on the statistics that feed the cost decisions instead.

Rule	What it does	The win
Predicate pushdown	Move filters toward the scan	Fewer rows enter every later step
Projection pushdown	Drop unused columns early	Less data read and carried
Constant folding	Evaluate constants once at plan time	No per-row recomputation
Boolean simplification	Simplify redundant conditions	Cheaper, clearer filters

Two more rules earn their keep often enough to name. Filter combination merges adjacent filters into one, so a query that filters on in_stock in one step and on price in another evaluates a single combined predicate rather than two passes. Limit pushdown moves a limit toward the source where it is legal, so a query that asks for the first ten rows after a projection can stop reading early instead of materialising everything and then trimming. Neither is dramatic on its own, but together with pushdown and pruning they explain why a deliberately roundabout query and a clean one so often land on byte-identical physical plans: the rules grind both toward the same canonical form.

What makes this phase trustworthy is that every rule provably preserves the result. The rewrites are mathematically equivalent transformations, so the rewritten plan always returns what you asked for. This is why you can write your query in whatever order is clearest and rely on logical optimization to fix the ordering: the rules will push your late filter down and prune your unused columns regardless of how you wrote them, with a guarantee of correctness.

It is worth distinguishing this rule-based phase from the cost-based decisions that come later, because the difference matters for predictability. Rule-based optimization is deterministic and always applies: a filter always pushes down where it legally can, full stop. Cost-based optimization, coming in the physical planning phase, makes choices that depend on estimates and can therefore be wrong. Logical optimization is the part of Catalyst you can almost always trust blindly; the cost-based part is the one to watch.

The mechanism behind all of this rewards a glance, because it explains why Catalyst is extensible rather than a fixed engine. The plan is a tree, and each rule is a function that pattern-matches a shape in that tree and rewrites it: match a Filter sitting above a Project, swap their order, and you have pushed a projection. Catalyst runs batches of these rules to a fixed point, meaning it keeps applying them until a full pass changes nothing, which is how a single pushdown can cascade. That uniform tree-and-rule design is also why a library can register its own rules: a data source can add a rule that recognises when a filter can be handed to its native engine, which is how source-specific pushdown gets wired in.

Phase Three, Physical Planning: Choosing How to Execute

Daily Life

Interviews

Logical optimization produces an optimized logical plan that says what to compute but not how. Physical planning is where Catalyst decides the how: it generates one or more physical plans, concrete strategies for actually running each operation, and chooses among them. This is the phase where a groupBy becomes a specific aggregation strategy and, most importantly, where a join becomes a specific join algorithm.

Focus on the join strategy choice, where physical planning has its largest effect on performance. The same logical join can be executed as a broadcast hash join, where a small side is shipped to every executor and no big-side shuffle happens, or as a sort-merge join, where both sides are shuffled and sorted by key, or as a shuffle hash join. Catalyst picks based on its estimate of the sizes involved: if it believes one side is small enough, it chooses broadcast and avoids a shuffle; if both look large, it falls back to sort-merge.

•Logical plan (the WHAT)

Says: join these two on this key
Says: aggregate by this column
No execution strategy chosen yet
Produced by rule-based optimization

•Physical plan (the HOW)

Says: broadcast hash join, or sort-merge
Says: hash aggregate, in this many partitions
Concrete, runnable strategies
Chosen using size estimates (cost-based)

A concrete threshold sits behind the broadcast decision, and knowing it demystifies the choice. Spark broadcasts a side when its estimated size is under a configurable limit, ten megabytes by default. So the products dimension in the seed tables, small by any real measure, is a natural broadcast candidate, while the orders fact table is not. The word estimated carries the weight here: Spark broadcasts based on what it thinks the side weighs, not what it actually weighs, and the whole apparatus of statistics exists to keep that estimate trustworthy. A broadcast join collects the small side to the driver and ships it whole to every executor, so each executor probes it locally with no shuffle of the large side at all.

Because the join choice depends on estimated sizes, this is the first place the optimizer can get it wrong. If Catalyst underestimates a side it broadcasts, it can trigger a broadcast that is too large and overwhelms memory; if it overestimates a side that is actually small, it shuffles when it could have broadcast and skipped the shuffle. The physical plan is where good or bad statistics turn into a fast or slow execution, which is why the next section covers cost-based optimization and the one after covers reading the plan it produced.

It helps to connect this to the shuffle lesson. Every join strategy is really a decision about shuffles: broadcast avoids the big-side shuffle entirely, sort-merge accepts two shuffles plus two sorts, shuffle hash shuffles both sides but skips the sort. So physical planning is, in large part, the optimizer deciding how many shuffles your query will pay for. When you read a physical plan and count its Exchange nodes, you are reading the outcome of these strategy choices.

The join whose strategy the optimizer picks

The query below joins order_items to products and aggregates. In the physical plan, Catalyst would choose a join strategy here from the estimated sizes: a broadcast if products looks small, a sort-merge if both look large. Run it, and keep in mind the strategy is a planning decision Catalyst made, never something you wrote.

(order_items
   .join(products, "product_id")
   .groupBy("category")
   .agg(F.sum("quantity").alias("units"))
   .orderBy(F.col("units").desc()))

You wrote join, not broadcast hash join or sort-merge join; physical planning chose which one to use. On real data with fresh statistics, Catalyst would see that products is a small dimension and pick a broadcast, removing the big-side shuffle. With stale statistics it might pick sort-merge and shuffle both sides. The strategy is invisible in your code and decisive in your runtime, so reading the physical plan to see which one Catalyst chose is a core skill.

The aggregation in that same query hides a second physical choice worth naming. A groupBy can be run as a hash aggregate, which builds a hash map of group key to running total in memory, or as a sort aggregate, which sorts the rows by key and then sums adjacent runs. Hash aggregation is usually faster and is the default, but it needs the partial aggregation map to fit in memory; when the key space is huge or the input is already sorted, Spark may choose the sort variant instead. Crucially, Spark computes partial aggregates on each partition first and only shuffles the compact partials, so a sum over millions of order_items rows moves a tiny fraction of the data across the network. That partial-then-combine pattern is automatic, and it explains why a groupBy count costs so much less than its row count would suggest.

Cost-Based Optimization: When Statistics Drive the Plan

Daily Life

Interviews

The choices physical planning makes are only as good as the size estimates behind them, and cost-based optimization, CBO, is the part of Catalyst that tries to make those estimates accurate using real statistics about your tables. Without statistics, Catalyst falls back to crude heuristics, guessing sizes from file bytes and rule-of-thumb selectivity. With statistics, it can estimate the size of each intermediate result and choose join orders and strategies that minimise the total work.

Statistics come from analyzing a table: row counts, the number of distinct values in a column, min and max, null counts, sometimes histograms of the value distribution. With these, the optimizer can estimate, for example, that filtering on a particular value will keep roughly one percent of the rows, and therefore that the filtered result is small enough to broadcast. It can also choose the order to join several tables in, joining the most selective pair first so the intermediate results stay small. Crude heuristics botch these decisions; good statistics nail them.

With statistics (CBO on)	Without statistics	The consequence
Accurate size estimates per step	Crude guesses from file bytes	Right vs wrong join strategy
Selectivity from value distribution	Rule-of-thumb percentages	Right vs wrong broadcast decision
Cost-based join ordering	Join in the order written	Small vs large intermediate results

Join ordering is the part of CBO that pays off most as queries grow. Take a query that joins orders, order_items, products, and customers. Those four tables can be joined in many orders, and the orders are not equal: joining the two that produce the smallest intermediate result first keeps every later join small, while a bad order builds a massive intermediate that then has to be joined again. Without statistics, Catalyst largely joins in the order you wrote, which may be arbitrary. With row counts and selectivity, it estimates the size of each candidate intermediate and picks an order that keeps the running result small. On a star-schema query this one decision can turn minutes into seconds.

TIP

This is why a senior engineer runs ANALYZE TABLE to compute statistics on tables that feed important queries. Frame it in an interview like this: "the optimizer's join and broadcast choices come from size estimates, so stale or missing statistics make it guess; computing statistics gives it the information to choose correctly."

A selective filter the optimizer prices

Cost-based optimization estimates how many rows a filter keeps, and that estimate decides downstream choices. The fill-in below filters products to a high rating before aggregating; a good estimate of how selective that filter is lets Catalyst size the work correctly. Complete the filter and the aggregation.

> From products, keep only highly rated items (rating at least 4.5), then return the average price per category, highest average first. The optimizer estimates how many rows the rating filter keeps to plan the aggregation.

(products
   .___(F.col("rating") >= 4.5)
   .groupBy("category")
   .agg(F.___("price").alias("avg_price"))
   .orderBy(F.col("avg_price").desc(), F.col("category").asc()))

filter

avg

groupBy

sum

If the optimizer has fresh statistics, it estimates accurately how few products clear a 4.5 rating, and it can plan the aggregation for that small result. If its statistics are stale, it might assume far more rows survive and size the work wrong. The accuracy of that one selectivity estimate ripples through every choice the physical plan makes after it.

Histograms are the refinement that makes selectivity estimates honest on skewed data. Without one, the optimizer assumes values are spread evenly, so it estimates that a filter on rating keeps a fraction proportional to the width of the range you asked for. But ratings are not uniform; most products cluster at the high end. A histogram records how many rows fall in each band of values, so the optimizer can see that rating at least 4.5 actually keeps a large share, not the small slice a uniform assumption would predict. The cost of computing histograms is real, so you enable them for the columns that drive important filter and join decisions, not for every column blindly.

The catch, which the advanced tier dwells on, is that statistics go stale. A table analyzed last month and then loaded with ten times more data now lies to the optimizer: it reports the old, small size, and Catalyst makes decisions for a table that no longer exists. Stale statistics hurt more than missing ones, because the optimizer trusts them. Cost-based optimization is powerful when fed fresh numbers and a source of mysterious slow plans when the numbers rot, which is one of the main reasons adaptive query execution exists to re-decide at runtime.

Reading a Physical Plan

Daily Life

Interviews

All four phases end in a physical plan, and being able to read it is the practical payoff of understanding the phases. The plan is a tree of operators, printed by explain, and you read it from the bottom up because that is the order data flows: leaves are scans, and each operator consumes the output of the one below. A handful of operators and markers carry most of the meaning, and once you know them the plan reads as a diagnosis.

In the physical plan	What it means	What to do with it
Exchange	A shuffle (a wide operation)	Count them; each is a stage boundary and a cost
BroadcastExchange	A small side broadcast for a join	Confirms a broadcast hash join was chosen
*(n) marker	Whole-stage codegen fused n operators	Those operators run as one compiled loop
PushedFilters	A filter pushed to the scan	Confirms your filter reads less data
HashAggregate / SortAggregate	The aggregation strategy chosen	Hash is usual; sort implies a sorted input

The first thing to do with any plan is count the Exchange nodes, because each one is a shuffle and shuffles are where the cost lives. A plan with one Exchange shuffled once; three Exchanges means three reorganisations of the data. The second thing is to check the join operators: a BroadcastExchange feeding a BroadcastHashJoin tells you Catalyst chose to broadcast, which is usually what you want for a large-to-small join, while a SortMergeJoin with two Exchanges below it tells you both sides shuffled.

TIP

When you read a plan under adaptive query execution, expect to see AdaptiveSparkPlan at the top and some operators wrapped as query stages. AQE re-plans between stages using the real sizes it measured, so the plan you read after the run can differ from the one Catalyst first chose. If you see a join that started as a SortMergeJoin get rewritten to a BroadcastHashJoin in the final plan, that is AQE catching a size estimate the static plan got wrong, which is the failure mode the advanced tier covers.

The *(n) markers are Tungsten showing its work, and they are the bridge to the advanced tier. Each marker means that a run of operators was fused, by whole-stage code generation, into a single compiled function that runs as one tight loop. A plan dense with high *(n) markers is one where Spark generated efficient fused code; operators outside the markers are the ones that could not be fused, often because of an opaque UDF or an RDD step, so the markers show you where your declarative chain broke down.

Reading bottom up is a habit worth drilling once on a real shape. Take the join-and-aggregate query from the physical planning section. At the very bottom you would find two scans, one for order_items and one for products, each possibly carrying a PushedFilters note. Above the small side sits a BroadcastExchange, then a BroadcastHashJoin that consumes both. Above the join, a partial HashAggregate computes per-partition sums, then an Exchange shuffles those partials by category, then a final HashAggregate combines them, and a Sort and a TakeOrderedAndProject handle the orderBy at the top. Read in that order, the plan traces the data's journey instead of reading as a wall of operators.

Put together, reading a physical plan is a repeatable diagnostic. Count the Exchanges to find your shuffles. Check the join operators to see the strategies Catalyst chose, and whether they match what you expected from the sizes. Confirm your filters show up as PushedFilters. Note where the codegen markers stop, because that is where optimization stops. An engineer who can do this in thirty seconds on an explain output is reading the optimizer's decisions directly, and that is the skill this tier builds toward.

✓Do

Read a query as four phases: analysis, logical optimization, physical planning, codegen.
Trust rule-based rewrites (pushdown, pruning) as always-correct; watch cost-based join choices.
Run ANALYZE TABLE so CBO has fresh statistics to choose join order and strategy.
Read the physical plan bottom up: count Exchanges, check join operators, confirm PushedFilters.

✗Don't

Don't assume the join strategy is optimal; it came from estimates that can be wrong.
Don't let table statistics go stale; the optimizer trusts them and stale stats mislead it.
Don't ignore where the *(n) codegen markers stop; that is where optimization broke down.
Don't confuse the logical plan (what) with the physical plan (how); the costs live in the how.

❯❯❯PUTTING IT ALL TOGETHER

> A query that joins three tables has gotten slow since one of the tables grew, and the physical plan shows a SortMergeJoin with shuffles where you expected a broadcast. The tables were last analyzed months ago.

The SortMergeJoin means Catalyst did not think the small side was small enough to broadcast.

You suspect the stale statistics: the optimizer is using an old, larger size estimate for a side that is actually broadcastable.

You run ANALYZE TABLE to refresh the statistics so cost-based optimization sees the real sizes.

With fresh stats, the plan switches to a BroadcastHashJoin, removing the big-side shuffle, and you confirm the change in the new explain output.

KEY TAKEAWAYS

Analysis binds your names to the catalog and catches reference and type errors before optimization.

Logical optimization applies rule-based rewrites (pushdown, pruning, folding) that always preserve the result.

Physical planning chooses how to execute, especially the join strategy, from estimated sizes.

Cost-based optimization uses table statistics to pick join order and strategy; stale stats mislead it.

A physical plan reads bottom up: count Exchanges, check join operators, confirm pushdown, find codegen.

A query flows through four stages, and the trouble is always in one of them.

Category: Spark
Difficulty: intermediate
Duration: 14 minutes
Challenges: 2 hands-on challenges

Topics covered: Phase One, Analysis: Resolving What You Named, Phase Two, Logical Optimization: The Rule Rewrites, Phase Three, Physical Planning: Choosing How to Execute, Cost-Based Optimization: When Statistics Drive the Plan, Reading a Physical Plan

Lesson Sections

Phase One, Analysis: Resolving What You Named
Catalyst's first move is unglamorous and unavoidable: it resolves your query against the catalog. When you wrote select category from products, the optimizer does not yet know that products is a real table, that category is a real column, or what type that column is. Analysis binds every name you used to a concrete thing in the catalog, the metadata store that knows which tables and columns exist and their types. This is where the errors you actually see come from. A misspelled column name, a ta
Phase Two, Logical Optimization: The Rule Rewrites
With a resolved plan in hand, Catalyst enters the phase most people mean when they say optimization: logical optimization, a battery of rule-based rewrites that transform the plan into an equivalent but cheaper one. These are the optimizations from the beginner tier, now placed in their proper home. Each rule is a small, provably-correct transformation, and Catalyst applies them repeatedly until the plan stops changing. The headline rules are the ones that move and shrink work. Predicate pushdow
Phase Three, Physical Planning: Choosing How to Execute
Logical optimization produces an optimized logical plan that says what to compute but not how. Physical planning is where Catalyst decides the how: it generates one or more physical plans, concrete strategies for actually running each operation, and chooses among them. This is the phase where a groupBy becomes a specific aggregation strategy and, most importantly, where a join becomes a specific join algorithm. Focus on the join strategy choice, where physical planning has its largest effect on
Cost-Based Optimization: When Statistics Drive the Plan
The choices physical planning makes are only as good as the size estimates behind them, and cost-based optimization, CBO, is the part of Catalyst that tries to make those estimates accurate using real statistics about your tables. Without statistics, Catalyst falls back to crude heuristics, guessing sizes from file bytes and rule-of-thumb selectivity. With statistics, it can estimate the size of each intermediate result and choose join orders and strategies that minimise the total work. Statisti
Reading a Physical Plan
All four phases end in a physical plan, and being able to read it is the practical payoff of understanding the phases. The plan is a tree of operators, printed by explain, and you read it from the bottom up because that is the order data flows: leaves are scans, and each operator consumes the output of the one below. A handful of operators and markers carry most of the meaning, and once you know them the plan reads as a diagnosis. The first thing to do with any plan is count the Exchange nodes,