How a Spark Job Runs: Stages and Plans

A one-line change to where a job's stage boundaries fall has cut real-world Spark runtimes from hours to minutes, and the people who make that change are reading the same execution model you met in the beginner tier, one level deeper. You know the cast: driver, executors, partitions, tasks, slots. The beginner picture treats a job as one flat batch of tasks. It is not. A real job is carved into stages, and the boundaries between stages are where almost all the cost lives. This tier goes into how the driver organizes tasks into stages, how parallelism actually plays out, and the launch-time knobs that shape the run.

Job, Stage, Task

Daily Life

Interviews

There are three levels, and they nest. A job is everything one action triggers. A job is split into stages. A stage is split into tasks. The vocabulary matters because the Spark UI is organized exactly this way, and when you debug a slow job you navigate jobs to stages to tasks to find the problem.

Level	What defines it	How many
Job	One action (count, write, collect)	One per action you call
Stage	A run of work needing no data movement	Split at every shuffle boundary
Task	One unit of work on one partition	One per partition, per stage

The boundary that creates a new stage

A stage is the largest chunk of work Spark can run without moving data between machines. The instant an operation needs data from other partitions, Spark must end the current stage, redistribute the data across the network (a shuffle), and start a new stage. So the number of stages in your job is almost exactly the number of shuffles plus one. Counting stages is counting shuffles, and shuffles are the expensive part.

In SQL terms: a chain of WHERE and SELECT and computed columns can all run in one stage because each row's output depends only on that row. A GROUP BY or a JOIN needs to bring matching keys together from across the cluster, which forces a shuffle and therefore a stage boundary. You can predict your stage count by scanning your code for those operations.

(order_items
   .join(products, "product_id")
   .groupBy("category")
   .agg(F.countDistinct("order_id").alias("orders"))
   .orderBy(F.col("orders").desc()))

> This job already has a join (one shuffle). Add the SECOND wide op that creates the second shuffle boundary, taking the job to three stages.

(order_items
   .join(products, "product_id")
   .___("category")
   .agg(F.countDistinct("order_id").alias("orders")))

groupBy

select

filter

withColumn

Why Stages Exist At All

Daily Life

Interviews

Stages are not an arbitrary chunking. They exist because of a hard physical fact: some operations let each task work alone, and some force tasks to wait for each other. Spark draws the stage boundary exactly where independence ends.

•Stays in one stage

Each output partition needs one input partition
filter, select, map, withColumn
Tasks never talk to each other
Runs pipelined, nearly free

•Forces a new stage

An output partition needs many input partitions
groupBy, join, distinct, repartition
Data must cross the network first
Stage cannot start until the prior one fully finishes

The barrier nobody mentions

A stage boundary is also a synchronization barrier. The next stage cannot start until every task in the current stage has finished, because the shuffle that feeds it needs all the data to be written first. This is why one slow task, on one oversized partition, can stall an entire job: 199 tasks finished in 30 seconds, the 200th runs for 20 minutes, and the whole next stage waits on it. The barrier turns a single skewed partition into a job-wide delay.

TIP

Pipelining is the reward for staying narrow. Inside a stage, Spark fuses filter into select into withColumn so a row flows through all of them in one pass, never landing on disk. That is why ten narrow operations cost about the same as one. The expensive thing is not the number of operations; it is the number of stage boundaries.

(order_items
   .filter(F.col("quantity") >= 3)
   .withColumn("line_total", F.col("quantity") * F.col("unit_price"))
   .select("order_id", "line_total")
   .orderBy(F.col("line_total").desc()))

> Only ONE of these ops forces a new stage (a shuffle). Fill the blank with the WIDE op; the narrow ones stay in one stage.

(order_items
   .filter(F.col("quantity") >= 1)
   .___("user_id")
   .agg(F.sum("quantity").alias("q")))

groupBy

filter

select

withColumn

Reading Parallelism

Daily Life

Interviews

Now we make the wave arithmetic precise, because under-parallelism and over-parallelism are two of the most common reasons a job is slow, and they have opposite fixes. The number to watch is tasks-per-stage versus available slots.

4 waves

200 tasks / 50 slots

The two failure shapes

Under-parallelism: fewer partitions than slots. You rented 200 slots, the stage has 40 tasks, and 160 slots sit idle. The fix is more partitions (repartition up), not more hardware.

Over-parallelism: hundreds of thousands of tiny tasks. Each task has fixed scheduling overhead (launch, serialize, report back), and when the task does milliseconds of real work, that overhead dominates. The fix is fewer, larger partitions (coalesce down).

The sweet spot: tasks slightly outnumber slots (so every slot stays fed across a few waves) and each task does seconds of real work, not milliseconds.

This is also why the default spark.sql.shuffle.partitions of 200 is a frequent culprit. Two hundred is fine for a few gigabytes and disastrous for a few terabytes (each post-shuffle partition becomes huge, tasks spill to disk) or for a few megabytes (200 near-empty tasks). The default is a starting guess, not a tuned value.

> You have 40 partitions but 200 slots, so 160 sit idle. Fill the blank with the op that RAISES the partition count to fill the slots (the fix for under-parallelism).

order_items.___(200)

repartition

coalesce

limit

cache

Where the Driver Lives

Daily Life

Interviews

The cluster manager is the layer that owns the machines and grants executors. You will run on one of three, and they are largely interchangeable from your code's point of view. What actually changes your debugging is the deploy mode: where the driver process physically runs.

Cluster manager	Where you see it	What it is
YARN	Hadoop / EMR clusters	The classic Hadoop resource manager
Kubernetes	Modern cloud-native setups	Executors run as pods
Standalone	Small or test clusters	Spark's own built-in manager

Client mode vs cluster mode, and why it bites you

•Client mode

Driver runs where you launched it (your laptop, an edge node)
Driver logs print to your terminal
Network round-trips to executors on every collect
Good for interactive notebooks; fragile for production

•Cluster mode

Driver runs on a worker inside the cluster
Driver logs live in the cluster, not your screen
Driver sits next to the executors, low latency
The standard for scheduled production jobs

TIP

If your job runs fine in a notebook but mysteriously hangs or runs out of memory as a scheduled job, suspect the deploy mode. In client mode your driver is your laptop, with laptop memory and a flaky connection. A collect() that worked at your desk can OOM the driver when the same code runs in cluster mode against full production data.

Multiple Choice

Your job runs fine in a notebook but OOMs the driver as a scheduled job. Which deploy mode runs the driver INSIDE the cluster (the production mode), not on the machine that launched it?

spark-submit and the Config Surface

Daily Life

Interviews

Everything we have described is shaped by a handful of numbers you set when you launch the job. These are the levers. You do not need to memorize the whole config surface, but you must connect each of these to a concept you already learned, because that connection is exactly what an interviewer probes.

spark - submit \ my_job.py

Lever	What it controls	Concept it maps to
--num-executors	How many executor processes	How many machines do the work
--executor-cores	Slots per executor	Tasks running at once per executor
--executor-memory	Heap each executor gets	How big a partition can be before it spills
--driver-memory	Heap the driver gets	How much you can safely collect() back

Why these numbers interact

These are not four independent dials. num-executors times executor-cores is your total slots, which only helps if you have enough partitions to fill them. executor-memory has to cover the partitions a core is holding plus shuffle buffers, so cranking executor-cores without raising memory can cause spills or out-of-memory errors. The right way to answer a sizing question is to derive the numbers from the data: data size sets partition count, partition count and target wave count set slots, slots and partition size set memory. We go deep on that in the cluster sizing lesson; here, the point is that these knobs are the physical expression of the model you just built.

Same numbers, configured many ways: spark-submit flags, a SparkSession.builder.config() call, a cluster default in the platform (Databricks, EMR), or spark-defaults.conf. The values mean the same thing wherever they are set. A common confusion is a flag being silently overridden by a platform default; when a setting seems ignored, check the precedence order.

> You will reuse this filtered DataFrame in several actions. Fill the blank with the op that keeps it in memory so each action does not recompute the whole chain.

order_items.filter(F.col("quantity") >= 2).___()

cache

repartition

collect

count

❯❯❯PUTTING IT ALL TOGETHER

> You are handed a PySpark job that reads a few terabytes, filters, then joins to a dimension table and aggregates. It runs on 10 executors with 5 cores each and nobody has touched the defaults. It takes hours, and in the Spark UI one stage shows 199 tasks done in 30 seconds and one still running after 20 minutes. The interviewer asks what is happening and what you would change.

Read the shape first: the JOIN and the GROUP BY each force a shuffle and therefore a stage boundary, while the filter and computed columns fold into one stage, so you can predict the stage count straight from the code.

A stage boundary is a synchronization barrier, so the 200th task running for 20 minutes holds the entire next stage hostage, which is how one oversized partition becomes a job wide delay.

The 200 task count is the spark.sql.shuffle.partitions default, a starting guess that is disastrous at terabyte scale because each post shuffle partition becomes huge and tasks spill to disk.

Compare tasks per stage against 50 available slots, 10 executors times 5 cores, and raise the shuffle partition count so the work arrives in reasonable waves instead of one enormous partition per slot.

Derive the sizing rather than guessing: data size sets partition count, partition count and target wave count set slots, and slots plus partition size set the executor-memory value, since raising cores without raising memory buys spills and out of memory errors.

Before blaming the config, confirm the values you set are the values in effect, because a spark-submit flag silently overridden by a platform default in Databricks or EMR looks exactly like a setting that does nothing.

KEY TAKEAWAYS

The three levels nest as job, stage, task: an action triggers a job, a job splits into stages, a stage splits into tasks, and the Spark UI is organized the same way so you navigate down that chain to find a slow job's cause.

Narrow work such as WHERE, SELECT, and computed columns stays in one stage because each row's output depends only on that row, while a GROUP BY or JOIN must bring keys together and forces a shuffle boundary.

A stage boundary is a synchronization barrier: the next stage cannot start until every task finishes writing its shuffle output, so 199 fast tasks plus one 20 minute task means the whole job waits 20 minutes.

Both under parallelism and over parallelism make jobs slow and they have opposite fixes, so the number to watch is tasks per stage against available slots rather than raw cluster size.

The spark.sql.shuffle.partitions default of 200 is a guess, not a tuned value: too few for terabytes means huge partitions that spill, too many for megabytes means 200 near empty tasks.

Deploy mode decides where the driver process physically runs, which is what actually changes your debugging, while the cluster manager itself is largely interchangeable from your code's point of view.

The boundaries between stages are where the cost lives.

Category: Spark
Difficulty: intermediate
Duration: 12 minutes
Challenges: 7 hands-on challenges

Topics covered: Job, Stage, Task, Why Stages Exist At All, Reading Parallelism, Where the Driver Lives, spark-submit and the Config Surface

Lesson Sections

Job, Stage, Task
There are three levels, and they nest. A job is everything one action triggers. A job is split into stages. A stage is split into tasks. The vocabulary matters because the Spark UI is organized exactly this way, and when you debug a slow job you navigate jobs to stages to tasks to find the problem. The boundary that creates a new stage In SQL terms: a chain of WHERE and SELECT and computed columns can all run in one stage because each row's output depends only on that row. A GROUP BY or a JOIN n
Why Stages Exist At All
Stages are not an arbitrary chunking. They exist because of a hard physical fact: some operations let each task work alone, and some force tasks to wait for each other. Spark draws the stage boundary exactly where independence ends. The barrier nobody mentions A stage boundary is also a synchronization barrier. The next stage cannot start until every task in the current stage has finished, because the shuffle that feeds it needs all the data to be written first. This is why one slow task, on one
Reading Parallelism
Now we make the wave arithmetic precise, because under-parallelism and over-parallelism are two of the most common reasons a job is slow, and they have opposite fixes. The number to watch is tasks-per-stage versus available slots. The two failure shapes This is also why the default spark.sql.shuffle.partitions of 200 is a frequent culprit. Two hundred is fine for a few gigabytes and disastrous for a few terabytes (each post-shuffle partition becomes huge, tasks spill to disk) or for a few megaby
Where the Driver Lives
The cluster manager is the layer that owns the machines and grants executors. You will run on one of three, and they are largely interchangeable from your code's point of view. What actually changes your debugging is the deploy mode: where the driver process physically runs. Client mode vs cluster mode, and why it bites you
spark-submit and the Config Surface
Everything we have described is shaped by a handful of numbers you set when you launch the job. These are the levers. You do not need to memorize the whole config surface, but you must connect each of these to a concept you already learned, because that connection is exactly what an interviewer probes. Why these numbers interact These are not four independent dials. num-executors times executor-cores is your total slots, which only helps if you have enough partitions to fill them. executor-memor