How a Spark Job Runs: Stages and Plans
Job, Stage, Task
| Level | What defines it | How many |
|---|---|---|
| Job | One action (count, write, collect) | One per action you call |
| Stage | A run of work needing no data movement | Split at every shuffle boundary |
| Task | One unit of work on one partition | One per partition, per stage |
The boundary that creates a new stage
A stage is the largest chunk of work Spark can run without moving data between machines. The instant an operation needs data from other partitions, Spark must end the current stage, redistribute the data across the network (a shuffle), and start a new stage. So the number of stages in your job is almost exactly the number of shuffles plus one. Counting stages is counting shuffles, and shuffles are the expensive part.
In SQL terms: a chain of WHERE and SELECT and computed columns can all run in one stage because each row's output depends only on that row. A GROUP BY or a JOIN needs to bring matching keys together from across the cluster, which forces a shuffle and therefore a stage boundary. You can predict your stage count by scanning your code for those operations.
Why Stages Exist At All
The barrier nobody mentions
Reading Parallelism
The two failure shapes
This is also why the default spark.sql.shuffle.partitions of 200 is a frequent culprit. Two hundred is fine for a few gigabytes and disastrous for a few terabytes (each post-shuffle partition becomes huge, tasks spill to disk) or for a few megabytes (200 near-empty tasks). The default is a starting guess, not a tuned value.
Where the Driver Lives
| Cluster manager | Where you see it | What it is |
|---|---|---|
| YARN | Hadoop / EMR clusters | The classic Hadoop resource manager |
| Kubernetes | Modern cloud-native setups | Executors run as pods |
| Standalone | Small or test clusters | Spark's own built-in manager |
Client mode vs cluster mode, and why it bites you
spark-submit and the Config Surface
| Lever | What it controls | Concept it maps to |
|---|---|---|
| --num-executors | How many executor processes | How many machines do the work |
| --executor-cores | Slots per executor | Tasks running at once per executor |
| --executor-memory | Heap each executor gets | How big a partition can be before it spills |
| --driver-memory | Heap the driver gets | How much you can safely collect() back |
Why these numbers interact
Same numbers, configured many ways: spark-submit flags, a SparkSession.builder.config() call, a cluster default in the platform (Databricks, EMR), or spark-defaults.conf. The values mean the same thing wherever they are set. A common confusion is a flag being silently overridden by a platform default; when a setting seems ignored, check the precedence order.
The boundaries between stages are where the cost lives.
- Category
- SPARK
- Difficulty
- intermediate
- Duration
- 12 minutes
- Challenges
- 7 hands-on challenges
Topics covered: Job, Stage, Task, Why Stages Exist At All, Reading Parallelism, Where the Driver Lives, spark-submit and the Config Surface
Lesson Sections
- Job, Stage, Task
There are three levels, and they nest. A job is everything one action triggers. A job is split into stages. A stage is split into tasks. The vocabulary matters because the Spark UI is organized exactly this way, and when you debug a slow job you navigate jobs to stages to tasks to find the problem. The boundary that creates a new stage In SQL terms: a chain of WHERE and SELECT and computed columns can all run in one stage because each row's output depends only on that row. A GROUP BY or a JOIN n
- Why Stages Exist At All
Stages are not an arbitrary chunking. They exist because of a hard physical fact: some operations let each task work alone, and some force tasks to wait for each other. Spark draws the stage boundary exactly where independence ends. The barrier nobody mentions A stage boundary is also a synchronization barrier. The next stage cannot start until every task in the current stage has finished, because the shuffle that feeds it needs all the data to be written first. This is why one slow task, on one
- Reading Parallelism
Now we make the wave arithmetic precise, because under-parallelism and over-parallelism are two of the most common reasons a job is slow, and they have opposite fixes. The number to watch is tasks-per-stage versus available slots. The two failure shapes This is also why the default spark.sql.shuffle.partitions of 200 is a frequent culprit. Two hundred is fine for a few gigabytes and disastrous for a few terabytes (each post-shuffle partition becomes huge, tasks spill to disk) or for a few megaby
- Where the Driver Lives
The cluster manager is the layer that owns the machines and grants executors. You will run on one of three, and they are largely interchangeable from your code's point of view. What actually changes your debugging is the deploy mode: where the driver process physically runs. Client mode vs cluster mode, and why it bites you If your job runs fine in a notebook but mysteriously hangs or runs out of memory as a scheduled job, suspect the deploy mode. In client mode your driver is your laptop, with
- spark-submit and the Config Surface
Everything we have described is shaped by a handful of numbers you set when you launch the job. These are the levers. You do not need to memorize the whole config surface, but you must connect each of these to a concept you already learned, because that connection is exactly what an interviewer probes. Why these numbers interact These are not four independent dials. num-executors times executor-cores is your total slots, which only helps if you have enough partitions to fill them. executor-memor