How a Spark Job Runs: Stages and Plans

You know the cast: driver, executors, partitions, tasks, slots. The beginner picture treats a job as one flat batch of tasks. It is not. A real job has internal structure: it is carved into stages, and the boundaries between stages are where almost all the cost lives. This tier assumes you know what a task and a partition are, and goes one level deeper into how the driver organizes them into stages, how parallelism actually plays out, and the launch-time knobs that shape the run.

Job, Stage, Task

Daily Life
Interviews
There are three levels, and they nest. A job is everything one action triggers. A job is split into stages. A stage is split into tasks. The vocabulary matters because the Spark UI is organized exactly this way, and when you debug a slow job you navigate jobs to stages to tasks to find the problem.
LevelWhat defines itHow many
JobOne action (count, write, collect)One per action you call
StageA run of work needing no data movementSplit at every shuffle boundary
TaskOne unit of work on one partitionOne per partition, per stage

The boundary that creates a new stage

A stage is the largest chunk of work Spark can run without moving data between machines. The instant an operation needs data from other partitions, Spark must end the current stage, redistribute the data across the network (a shuffle), and start a new stage. So the number of stages in your job is almost exactly the number of shuffles plus one. Counting stages is counting shuffles, and shuffles are the expensive part.

In SQL terms: a chain of WHERE and SELECT and computed columns can all run in one stage because each row's output depends only on that row. A GROUP BY or a JOIN needs to bring matching keys together from across the cluster, which forces a shuffle and therefore a stage boundary. You can predict your stage count by scanning your code for those operations.

Why Stages Exist At All

Daily Life
Interviews
Stages are not an arbitrary chunking. They exist because of a hard physical fact: some operations let each task work alone, and some force tasks to wait for each other. Spark draws the stage boundary exactly where independence ends.
Best Practice
    Avoid

      The barrier nobody mentions

      A stage boundary is also a synchronization barrier. The next stage cannot start until every task in the current stage has finished, because the shuffle that feeds it needs all the data to be written first. This is why one slow task, on one oversized partition, can stall an entire job: 199 tasks finished in 30 seconds, the 200th runs for 20 minutes, and the whole next stage waits on it. The barrier turns a single skewed partition into a job-wide delay.
      TIP

      Reading Parallelism

      Daily Life
      Interviews
      Now we make the wave arithmetic precise, because under-parallelism and over-parallelism are two of the most common reasons a job is slow, and they have opposite fixes. The number to watch is tasks-per-stage versus available slots.
      1Stage has 200 tasks(200 partitions).Cluster has 50 slots.200 / 50 = 4 waves of execution.If each task takes ~ 15 s, the stage takes ~ 60 s, even though no single task ran longer than 15 s.

      The two failure shapes

      alert
      Under-parallelism: fewer partitions than slots. You rented 200 slots, the stage has 40 tasks, and 160 slots sit idle. The fix is more partitions (repartition up), not more hardware.
      alert
      Over-parallelism: hundreds of thousands of tiny tasks. Each task has fixed scheduling overhead (launch, serialize, report back), and when the task does milliseconds of real work, that overhead dominates. The fix is fewer, larger partitions (coalesce down).
      check
      The sweet spot: tasks slightly outnumber slots (so every slot stays fed across a few waves) and each task does seconds of real work, not milliseconds.

      This is also why the default spark.sql.shuffle.partitions of 200 is a frequent culprit. Two hundred is fine for a few gigabytes and disastrous for a few terabytes (each post-shuffle partition becomes huge, tasks spill to disk) or for a few megabytes (200 near-empty tasks). The default is a starting guess, not a tuned value.

      Where the Driver Lives

      Daily Life
      Interviews
      The cluster manager is the layer that owns the machines and grants executors. You will run on one of three, and they are largely interchangeable from your code's point of view. What actually changes your debugging is the deploy mode: where the driver process physically runs.
      Cluster managerWhere you see itWhat it is
      YARNHadoop / EMR clustersThe classic Hadoop resource manager
      KubernetesModern cloud-native setupsExecutors run as pods
      StandaloneSmall or test clustersSpark's own built-in manager

      Client mode vs cluster mode, and why it bites you

      Best Practice
        Avoid
          TIP

          spark-submit and the Config Surface

          Daily Life
          Interviews
          Everything we have described is shaped by a handful of numbers you set when you launch the job. These are the levers. You do not need to memorize the whole config surface, but you must connect each of these to a concept you already learned, because that connection is exactly what an interviewer probes.
          1spark - submit \ my_job.py
          LeverWhat it controlsConcept it maps to
          --num-executorsHow many executor processesHow many machines do the work
          --executor-coresSlots per executorTasks running at once per executor
          --executor-memoryHeap each executor getsHow big a partition can be before it spills
          --driver-memoryHeap the driver getsHow much you can safely collect() back

          Why these numbers interact

          These are not four independent dials. num-executors times executor-cores is your total slots, which only helps if you have enough partitions to fill them. executor-memory has to cover the partitions a core is holding plus shuffle buffers, so cranking executor-cores without raising memory can cause spills or out-of-memory errors. The right way to answer a sizing question is to derive the numbers from the data: data size sets partition count, partition count and target wave count set slots, slots and partition size set memory. We go deep on that in the cluster sizing lesson; here, the point is that these knobs are the physical expression of the model you just built.

          Same numbers, configured many ways: spark-submit flags, a SparkSession.builder.config() call, a cluster default in the platform (Databricks, EMR), or spark-defaults.conf. The values mean the same thing wherever they are set. A common confusion is a flag being silently overridden by a platform default; when a setting seems ignored, check the precedence order.

          The boundaries between stages are where the cost lives.

          Category
          SPARK
          Difficulty
          intermediate
          Duration
          12 minutes
          Challenges
          7 hands-on challenges

          Topics covered: Job, Stage, Task, Why Stages Exist At All, Reading Parallelism, Where the Driver Lives, spark-submit and the Config Surface

          Lesson Sections

          1. Job, Stage, Task

            There are three levels, and they nest. A job is everything one action triggers. A job is split into stages. A stage is split into tasks. The vocabulary matters because the Spark UI is organized exactly this way, and when you debug a slow job you navigate jobs to stages to tasks to find the problem. The boundary that creates a new stage In SQL terms: a chain of WHERE and SELECT and computed columns can all run in one stage because each row's output depends only on that row. A GROUP BY or a JOIN n

          2. Why Stages Exist At All

            Stages are not an arbitrary chunking. They exist because of a hard physical fact: some operations let each task work alone, and some force tasks to wait for each other. Spark draws the stage boundary exactly where independence ends. The barrier nobody mentions A stage boundary is also a synchronization barrier. The next stage cannot start until every task in the current stage has finished, because the shuffle that feeds it needs all the data to be written first. This is why one slow task, on one

          3. Reading Parallelism

            Now we make the wave arithmetic precise, because under-parallelism and over-parallelism are two of the most common reasons a job is slow, and they have opposite fixes. The number to watch is tasks-per-stage versus available slots. The two failure shapes This is also why the default spark.sql.shuffle.partitions of 200 is a frequent culprit. Two hundred is fine for a few gigabytes and disastrous for a few terabytes (each post-shuffle partition becomes huge, tasks spill to disk) or for a few megaby

          4. Where the Driver Lives

            The cluster manager is the layer that owns the machines and grants executors. You will run on one of three, and they are largely interchangeable from your code's point of view. What actually changes your debugging is the deploy mode: where the driver process physically runs. Client mode vs cluster mode, and why it bites you If your job runs fine in a notebook but mysteriously hangs or runs out of memory as a scheduled job, suspect the deploy mode. In client mode your driver is your laptop, with

          5. spark-submit and the Config Surface

            Everything we have described is shaped by a handful of numbers you set when you launch the job. These are the levers. You do not need to memorize the whole config surface, but you must connect each of these to a concept you already learned, because that connection is exactly what an interviewer probes. Why these numbers interact These are not four independent dials. num-executors times executor-cores is your total slots, which only helps if you have enough partitions to fill them. executor-memor