Reading the Plan: DAG, Stages, and explain()

A job that ran in four minutes yesterday is taking twelve today. Before touching the code, an experienced engineer opens the DAG, the picture Spark draws of the plan it built from your lazy chain. You already know transformations are lazy and an action runs them. Here we look at what Spark builds in that lazy gap: a directed graph of your operations that, read correctly, is a map of where your job spends its time. Some edges in that graph are nearly free and others are ruinously expensive, and most of this tier is learning to tell them apart at a glance.

The DAG: Your Plan as a Graph

Daily Life

Interviews

When you build a chain of transformations, Spark records it as a DAG, a directed acyclic graph. Directed because the data flows one way, from source to result. Acyclic because it never loops back on itself. Each transformation you wrote is a node, and the edges show how data flows from one operation into the next. The DAG is the concrete form of the plan that laziness let Spark assemble: it is everything you described, captured as a graph, waiting for an action to execute it.

The reason the DAG matters, rather than being an internal detail, is that its shape predicts cost. Not all edges in the graph are equal. Some operations let data flow straight through, each output piece built from one input piece, with nothing crossing the network. Others force a reorganisation where data has to be regrouped across the whole cluster. The graph makes the difference visible: the cheap operations chain together in a line, and the expensive ones cut the graph into separate pieces.

•Narrow edge (cheap)

map, filter, select, withColumn
Each output partition from ONE input partition
Links into the chain, no graph cut
Data flows straight through, in place

⚠Wide edge (expensive)

groupBy, join, distinct, repartition
Each output needs MANY input partitions
Cuts the graph; starts a new stage
Data is regrouped across the network

Keep that distinction, narrow versus wide, in view; it is the lens for the rest of this tier. A narrow transformation produces each output partition from exactly one input partition, so it can run wherever that input already lives, with no movement. A wide transformation produces each output partition from many input partitions, because rows that belong together, all the orders for one region, all the rows for one join key, are scattered across the cluster and have to be brought together first. That bringing-together is a shuffle, and a wide edge is where one happens.

Reading cost off the shape of the graph

Once you see the DAG as alternating cheap chains and expensive cuts, you can read a job's cost from its shape before you run it. A long line of narrow operations with no cuts is a job that streams through in one pass and costs almost nothing beyond reading the data. A graph chopped into many pieces by wide operations is a job that shuffles repeatedly, and each cut is a place where the whole dataset is reorganised across the network. When someone asks why a job is slow, the answer is usually pointing at one of those cuts and saying: this is where it pays.

The word shuffle deserves a moment, because it is the most expensive thing Spark routinely does and it is what every wide edge implies. When you group order_items by product_id, the rows for any one product are scattered across every partition on every executor, because nothing arranged them by product when they were read. To sum each product Spark must first move every row so that all rows for a given product land on the same machine. That movement writes data to disk on the sending side, sends it over the network, and reads it back on the receiving side. Next to a narrow operation that touches data in place, a shuffle is orders of magnitude more expensive, and that is why the graph cut it creates is where your eye should go first.

Narrow guarantees something stronger than just cheap, and it is worth stating precisely. A narrow operation promises that each output partition is built from exactly one input partition, which means the work can happen entirely on the machine that already holds that input, with no coordination between machines at all. A filter on orders runs independently on every partition with no machine needing to know what any other machine is doing. That independence is what lets Spark run all the partitions in parallel and pipeline the narrow steps together. The wide operation breaks the independence: now a machine cannot finish its piece without data from other machines, and that dependency is precisely the cost.

Counting Stages Is Counting Shuffles

Daily Life

Interviews

The DAG has a structure the Spark UI exposes directly, and it is the most useful thing to understand about reading a job: the graph is divided into stages, and the boundaries between stages fall at the wide operations. A stage is a run of work that needs no data movement, a maximal chain of narrow operations that can all pipeline together. The moment a wide operation appears and data must shuffle, the current stage ends and a new one begins on the far side of the shuffle.

This gives you a rule you can apply on sight, and interviewers ask for it constantly: the number of stages equals the number of shuffles plus one. One shuffle splits a job into two stages. Two shuffles, three stages. A job with no wide operations is a single stage. So when you open the Spark UI and see a job broke into four stages, you know it shuffled three times, and those three shuffles are where you start looking for cost.

stages = shuffles + 1

the rule to memorise

Reading the UI this way turns a vague slow job into a precise question. You do not ask why is Spark slow; you ask which stage dominates the runtime, and then which shuffle fed it. The stage view sorts by duration, so the expensive stage is right at the top, and the shuffle that created it is the wide operation just before it in your code. The DAG and the stage list are two views of the same plan: the DAG shows the shape, the stage list shows where the time went.

A two-stage job, made concrete

The demo below has exactly one wide operation, a groupBy, so it runs in two stages: everything up to the groupBy is stage one, and the aggregation after the shuffle is stage two. Run it, and as you do, picture the cut: the read and any narrow work happen on each partition in place, then the shuffle regroups by category, then the second stage sums each group.

(products
   .filter(F.col("in_stock") == 1)
   .groupBy("category")
   .agg(F.count(F.lit(1)).alias("n"))
   .orderBy(F.col("n").desc(), F.col("category").asc()))

The filter in that chain is narrow, so it costs nothing extra; it pipelines into the read and shrinks the data before the shuffle. The groupBy is the one wide operation, and it is the stage boundary. If you added a join before the groupBy, you would have two wide operations, hence two shuffles, hence three stages. Counting them is mechanical now, which is what you want under interview pressure.

There is a subtlety in the stages-equal-shuffles-plus-one rule that separates people who memorised it from people who can use it. Not every operation that feels like a boundary is one, and a few wide-looking calls fold into a single shuffle rather than each adding their own. A groupBy immediately followed by an aggregation is one shuffle, not two, because the aggregation rides on the regrouping the groupBy already did. Conversely, a single high-level call can hide more than one shuffle: a join whose two sides both need repartitioning can show two Exchanges. So the reliable way to count is not to count the wide-sounding words in your code, but to count the actual stage boundaries in the UI or the Exchange nodes in the plan. The rule is exact; the trick is counting the right things.

Why does the count matter so much in practice? Because each shuffle is not just a cost, it is a synchronisation point. A stage cannot begin until the shuffle feeding it has fully completed, which means the whole cluster waits for the slowest task of the previous stage before any task of the next stage starts. So shuffles do not merely move data; they create barriers where straggler tasks hold everyone up. A job with five stages has four such barriers, four moments where one slow partition stalls the entire cluster. When you reduce the number of shuffles in a job, you are not only saving the data-movement cost of each one, you are removing a barrier where the cluster idles waiting on its slowest member.

Logical vs Physical Plan: Reading explain()

Daily Life

Interviews

The DAG in the UI is the visual form of the plan; explain is the textual form, and it shows you two related things. When you call explain on a DataFrame, Spark prints the plan it intends to run, and at the detailed level it shows both the logical plan, what you asked for, and the physical plan, how Spark will actually execute it. The gap between the two is the optimisation that laziness made possible.

You read a plan tree from the bottom up, because that is the order data flows: the leaves are the scans that read your tables, and each operator above consumes the output of the one below it, until the top of the tree is your final result. So the bottom tells you what gets read and what filter got pushed down to the read. If your WHERE clause shows up as a PushedFilter at the scan, the optimiser moved it to the earliest possible point, and you are reading less data. If it shows up as a separate filter operator higher in the tree, it did not push down, and that is worth investigating.

In the physical plan	What it means	Why you care
Exchange	A shuffle: data reorganised across the network	Each one is a stage boundary and a cost
PushedFilters	A filter moved down to the data source	You read less data; this is good
*(n) marker	Whole-stage code generation fused n operators	Those operators run as one tight loop
BroadcastExchange	A small side shipped to every executor	A join avoided a full shuffle

Hunt for Exchange first. Every Exchange in the physical plan is a shuffle, so counting the Exchanges is another way to count your shuffles, and seeing where they sit tells you which part of your logic is expensive. A plan with one Exchange shuffled once. A plan with three Exchanges is doing three full reorganisations of the data, and if the job is slow, those are the three places to question. explain turns your intuition about narrow and wide into something you can verify.

TIP

When an interviewer asks how you would confirm a job shuffles where you think it does, the answer is explain. "I would read the physical plan bottom up and count the Exchange nodes; each one is a shuffle. I would also check whether my filter shows up as a PushedFilter at the scan." Now you are reading what Spark will do rather than guessing at it.

There is a second payoff to reading plans: it reveals decisions Spark made that you did not write. You never asked Spark to broadcast a table, but if the plan shows a BroadcastExchange, Spark decided one side was small enough to ship everywhere and skip a shuffle. You never asked it to fuse your filter and select into one loop, but the *(n) markers show it did. The plan is where the optimiser shows its work, and reading it is how you learn what your lazy chain really becomes.

Build the habit of calling explain before you run anything expensive, not after it disappoints you. The plan is available the moment you have a DataFrame, because it is built lazily without running the job, so you can inspect what Spark intends to do for free. Engineers who reach for explain early catch a missing pushdown or an unexpected extra shuffle before they have burned an hour of cluster time on it. The plan is a prediction, and reading it in advance beats reconstructing what went wrong afterward.

The reason there are two plans at all, logical and physical, maps cleanly onto something you already know from databases. The logical plan is the what: it says join order_items to products, filter to in-stock, group by category, in terms of relational operators, with no commitment to how. The physical plan is the how: it picks an actual join algorithm, decides whether to broadcast the small side or shuffle both, and inserts the Exchange nodes that move data. This is the same logical-versus-physical split a SQL planner makes, where the same query can be executed by a hash join or a merge join depending on what the optimiser judges cheaper. Spark's Catalyst optimiser is doing the work your database's planner does, and explain is its EXPLAIN.

One concrete pushdown story makes the value obvious. Suppose your orders table is stored as Parquet partitioned by order_date, and you filter to a single day. If the filter pushes down, the physical plan shows a PartitionFilter at the scan and Spark reads only that one day's files, skipping the rest of the table entirely without ever opening it. If the filter does not push down, perhaps because you wrapped order_date in a function the optimiser could not see through, the scan reads every day and the filter runs afterward on all of it. The two plans produce identical answers, but one reads a day and the other reads a year. The only way to tell which you wrote is to read the plan, so confirm the pushdown rather than hoping for it.

Pipelining: Why Narrow Ops Are Nearly Free

Daily Life

Interviews

Inside a single stage, between two shuffle boundaries, something elegant happens that explains why narrow operations cost almost nothing. Spark does not run your filter over the whole partition, write the result, then run your select over that, then write again. It fuses the narrow operations into a single pass: each row flows through the filter, then the select, then any other narrow steps, one row at a time, never landing in between. This is pipelining, and it is why a long chain of narrow operations is barely more expensive than one.

Whole-stage code generation is the mechanism behind it. Rather than calling a generic function for each operator on each row, Spark generates a single piece of custom code for the entire narrow chain in a stage and compiles it, so the whole stage runs as one tight loop over the data. The operators stop being separate steps and become one fused computation. Those *(n) markers in the physical plan are Spark telling you exactly which operators got fused into one generated function.

One pass, not many

Narrow operators in a stage run as a single fused loop; a row flows through all of them before the next row starts.

Nothing lands in between

No intermediate results are written to memory or disk between narrow steps, so there is almost no overhead.

The shuffle is the wall

Pipelining stops at a wide operation, because the next stage cannot start until the shuffle delivers its data.

This is why narrow versus wide pays off at the keyboard. Adding a filter or a computed column to a stage is nearly free, because it just joins the fused loop. Adding a groupBy or a join is expensive, because it ends the pipeline, forces a shuffle, and starts a fresh stage on the other side. So pack as much narrow work as possible into each stage, and treat each new wide operation as a deliberate decision with a real cost, not a free convenience.

Stack narrow work, pay once

The fill-in below builds a stage that does several narrow things in a row, a filter and a column computation, before any wide operation. All of that narrow work pipelines into a single pass. Supply the narrow operation that keeps only the rows you want and the one that computes the new value.

> From order_items, keep only rows with quantity above 1, then compute a line total as quantity times unit_price, returning item_id and that total. Both operations are narrow, so they fuse into one pass over the data.

(order_items
   .___(F.col("quantity") > 1)
   .___("line_total", F.col("quantity") * F.col("unit_price"))
   .select("item_id", "line_total"))

filter

withColumn

groupBy

join

Both the filter and the withColumn are narrow, so they cost one fused pass together, not two separate passes. Had you slipped a groupBy between them, you would have cut the stage in two and paid for a shuffle in the middle. As you write, know which of your operations join the free pipeline and which one starts a new expensive stage.

Pipelining also explains a piece of advice that can sound like folklore: filter as early as you can. The reason is not just that a filter shrinks the data, though it does. It is that a filter placed before a shuffle runs inside the cheap pipelined stage and reduces the number of rows that have to cross the network at the wide boundary. The same filter placed after the shuffle still shrinks the data, but only after you have already paid to move the rows it would have removed. Filtering order_items down to one region before a groupBy means the shuffle moves one region's rows; filtering after means the shuffle moved every region's rows and then you threw most away. The optimiser pushes filters down for this very reason, but writing them early makes the intent unmistakable and does not depend on the optimiser seeing through your code.

There is a limit to pipelining worth naming so you do not over-trust it. The fused loop is fast precisely because nothing lands in between, but that also means a stage holds no checkpoint of its intermediate work. If a task in a long narrow stage fails partway, there is nothing saved to resume from; the whole stage's work for that partition reruns from the start of the stage. For most jobs this is fine because narrow stages are cheap. But it is the reason a very long narrow chain is not automatically a free lunch: the pipeline is cheap to run and cheap to re-run, but it is the wide boundaries, where Spark does persist shuffle output, that give recovery its natural restart points. The advanced tier builds directly on this.

DAG vs Lineage: The Plan and the Recovery History

Daily Life

Interviews

Two words get used loosely and even interchangeably, and a careful candidate keeps them apart: the DAG and the lineage. They are built from the same dependency information, but they are read for opposite purposes and in opposite directions. Confusing them is a common interview stumble, and distinguishing them cleanly lands well.

•The DAG (the plan to run)

Read forward: source to result
Used to SCHEDULE the job into stages
Answers: how will this execute?
Lives for the duration of the job

•The lineage (the recovery history)

Read backward: result to source
Used to RECOVER a lost partition
Answers: how was this rebuilt?
Lets Spark recompute instead of re-read

The DAG is forward-looking. Spark reads it from the source toward the result to decide how to schedule the work: where the stages fall, what runs in parallel, which tasks depend on which. It is the plan of execution, and once the job finishes, the DAG has done its job.

Lineage is backward-looking. For each partition of data, Spark remembers the exact chain of transformations that produced it from its source. That record is the lineage, and its purpose is fault recovery: if a machine dies and takes a partition with it, Spark does not need to restart the whole job or re-read everything. It walks the lineage backward from the lost partition, finds the inputs it was built from, and recomputes just that one partition. The same dependency graph that the DAG reads forward to schedule, the lineage reads backward to heal.

So the one-sentence answer to DAG versus lineage: they are the same dependency graph used two ways, the DAG forward to plan execution and the lineage backward to recover from failure. That framing answers the question better than reciting two separate definitions, because it captures why both exist and why they are related. The advanced tier takes lineage seriously as the foundation of Spark's fault tolerance, including when it becomes expensive and how to cut it.

Be clear about the direction of each read, because that is where people slip. Scheduling reads forward because the scheduler has to decide what to run first, and the first thing to run is the source scan at the front of the graph, then the operations that depend on it, advancing toward the result. Recovery reads backward because it starts from a known loss, a specific partition that vanished, and has to ask what produced this, and what produced that, walking toward the source until it reaches data that still exists. Same edges, opposite traversal, opposite starting points: one starts at the source and aims at the result, the other starts at a hole and aims at the nearest surviving inputs. If you can state which direction each one walks and why, you have understood the relationship more deeply than a definition gives you.

✓Do

Read the DAG and the physical plan as narrow chains cut by wide shuffles.
Count stages to count shuffles (stages = shuffles + 1) and find the expensive one.
Read explain() bottom up; count Exchange nodes and check for PushedFilters.
Pack narrow work into each stage; treat every new wide op as a deliberate cost.

✗Don't

Don't treat all transformations as equal cost; narrow pipelines, wide shuffles.
Don't conflate the DAG and lineage; one schedules forward, one recovers backward.
Don't assume a filter pushed down; confirm it in the plan, do not hope.
Don't add a wide operation casually; each one cuts a stage and pays for a shuffle.

❯❯❯PUTTING IT ALL TOGETHER

> You open the Spark UI for a report job that got slow and see it ran in four stages, with the third stage taking most of the wall-clock time. You have the code and the explain output in front of you.

Four stages tells you immediately there were three shuffles, so you look at the three wide operations in your code.

You read the physical plan bottom up and count three Exchange nodes, confirming the three shuffles the stage count implied.

The slow third stage sits just after a groupBy Exchange, so that aggregation's shuffle is where the time is going.

You check the scan at the bottom for a PushedFilter and find your date filter did not push down, so the shuffle is moving far more data than it needs to.

KEY TAKEAWAYS

The DAG is your lazy chain as a graph; narrow edges chain cheaply, wide edges cut it and shuffle.

Stages equal shuffles plus one; counting stages in the UI counts your shuffles and finds the cost.

Read a physical plan bottom up: each Exchange is a shuffle, and a PushedFilter means you read less.

Narrow operators in a stage fuse into one pipelined pass; a wide operation ends the pipeline.

The DAG and lineage are one dependency graph used two ways: forward to schedule, backward to recover.

The shape of the graph is the map of where your time goes.

Category: Spark
Difficulty: intermediate
Duration: 14 minutes
Challenges: 3 hands-on challenges

Topics covered: The DAG: Your Plan as a Graph, Counting Stages Is Counting Shuffles, Logical vs Physical Plan: Reading explain(), Pipelining: Why Narrow Ops Are Nearly Free, DAG vs Lineage: The Plan and the Recovery History

Lesson Sections

The DAG: Your Plan as a Graph
When you build a chain of transformations, Spark records it as a DAG, a directed acyclic graph. Directed because the data flows one way, from source to result. Acyclic because it never loops back on itself. Each transformation you wrote is a node, and the edges show how data flows from one operation into the next. The DAG is the concrete form of the plan that laziness let Spark assemble: it is everything you described, captured as a graph, waiting for an action to execute it. The reason the DAG
Counting Stages Is Counting Shuffles
The DAG has a structure the Spark UI exposes directly, and it is the most useful thing to understand about reading a job: the graph is divided into stages, and the boundaries between stages fall at the wide operations. A stage is a run of work that needs no data movement, a maximal chain of narrow operations that can all pipeline together. The moment a wide operation appears and data must shuffle, the current stage ends and a new one begins on the far side of the shuffle. This gives you a rule y
Logical vs Physical Plan: Reading explain()
The DAG in the UI is the visual form of the plan; explain is the textual form, and it shows you two related things. When you call explain on a DataFrame, Spark prints the plan it intends to run, and at the detailed level it shows both the logical plan, what you asked for, and the physical plan, how Spark will actually execute it. The gap between the two is the optimisation that laziness made possible. You read a plan tree from the bottom up, because that is the order data flows: the leaves are t
Pipelining: Why Narrow Ops Are Nearly Free
Inside a single stage, between two shuffle boundaries, something elegant happens that explains why narrow operations cost almost nothing. Spark does not run your filter over the whole partition, write the result, then run your select over that, then write again. It fuses the narrow operations into a single pass: each row flows through the filter, then the select, then any other narrow steps, one row at a time, never landing in between. This is pipelining, and it is why a long chain of narrow ope
DAG vs Lineage: The Plan and the Recovery History
Two words get used loosely and even interchangeably, and a careful candidate keeps them apart: the DAG and the lineage. They are built from the same dependency information, but they are read for opposite purposes and in opposite directions. Confusing them is a common interview stumble, and distinguishing them cleanly lands well. The DAG is forward-looking. Spark reads it from the source toward the result to decide how to schedule the work: where the stages fall, what runs in parallel, which task