How a Spark Job Runs: Scheduler Internals

You can decompose any job into stages and tasks and reason about parallelism and deploy mode. That is enough to optimize most jobs. This advanced tier is about the failure edges: how the driver actually schedules and dispatches tasks, what happens when a task or a node dies, and the cases where the driver itself, not the executors, becomes the thing that is slow. This is the layer that separates 'I tuned the partitions' from 'I understand why the job behaves this way under failure and at scale.'

DAGScheduler vs TaskScheduler

Daily Life
Interviews
Inside the driver there are two schedulers, and they have a clean division of labor. Knowing the split is what lets you answer 'how does Spark turn my code into running tasks' without hand-waving.
Best Practice
    Avoid

      The handoff

      The flow is: your transformations become a logical plan, the DAGScheduler slices that plan into stages and orders them by their shuffle dependencies, and then for each ready stage it hands a TaskSet to the TaskScheduler, which dispatches the individual tasks to slots. The DAGScheduler decides what runs and in what order at the stage level; the TaskScheduler decides where each task runs and what to do when one fails. Every advanced behavior we cover next lives in one of these two: stage ordering and retry-of-a-whole-stage in the DAGScheduler, task placement and per-task retry in the TaskScheduler.

      This is also why a job can show many stages 'pending' while one runs: the DAGScheduler will not submit a downstream stage until its parent's shuffle output exists. The dependency graph, not your code order, dictates what is allowed to run.

      Task Failure and Retry

      Daily Life
      Interviews
      In a single database, a failed query just fails. In a cluster of hundreds of machines, something failing is normal: a node gets preempted, a network blip drops a connection, an executor runs out of memory. Spark is built to absorb that. A failed task is retried automatically, by default up to spark.task.maxFailures (4) times, before the whole stage, and then the job, is declared failed.
      check
      A single task fails: the TaskScheduler reruns just that task, possibly on a different executor. Because partitions are recomputed from their inputs, this is safe and cheap.
      alert
      The same task keeps failing on one node: Spark can blacklist (exclude) that node so retries land elsewhere, isolating a sick machine instead of looping on it.
      alert
      An executor dies mid-stage: every task it was running, and any shuffle output it held, must be recomputed elsewhere. This is why losing one executor late in a stage can be expensive.
      alert
      A task fails maxFailures times: the stage fails, and the whole job aborts. The retry budget is per task, not infinite.

      The footgun: retries plus side effects

      Automatic retry is only safe because a recomputed partition produces the same result. The moment a task has a side effect that is not idempotent, for example writing to an external system row by row, a retry can double-write. This is the deep reason Spark pushes you toward idempotent, atomic writes: the engine assumes any task may run more than once. A task that is not safe to re-run is a bug waiting for the first node failure to expose it. We cover the write-side discipline in the fault tolerance lesson; here the takeaway is that retry is a feature you must design for, not just rely on.
      TIP

      Speculative Execution

      Daily Life
      Interviews
      Recall the barrier: a stage cannot finish until its slowest task does. Speculative execution is Spark's answer to a straggler caused by a bad machine. When a task runs far longer than its peers, Spark can launch a second copy of it on a different executor, and whichever finishes first wins; the other is killed. It is a bet that the slowness is the machine, not the data.
      Best Practice
        Avoid

          The interview-grade distinction

          This is the cleanest way to show you understand the difference between a slow node and a slow partition. Speculation re-runs the same work elsewhere, so it only helps when the work itself is fine and the location is bad. Skew is the opposite: the work is genuinely larger, so re-running it changes nothing. Candidates who suggest 'turn on speculative execution' to fix skew reveal they have conflated the two. The honest answer is that speculation is for stragglers caused by the environment, and skew needs a data-layout fix.

          The Driver as Bottleneck

          Daily Life
          Interviews
          We started by saying the driver does not touch your data. That is true, and yet the driver can still be the thing that makes your job slow or kills it. Because it is the single coordinator, a few patterns route real load through it, and at scale that load matters. This is the most underappreciated section in the whole lesson, because people instinctively blame the executors.
          alert
          collect() pulls every result row to the driver's memory. On a big result this OOMs the driver instantly. The driver has one machine's memory; the executors have the whole cluster's.
          alert
          Broadcasting a large variable: the driver assembles the broadcast and ships it to every executor. A too-large broadcast strains the driver and the network.
          alert
          Very high task counts: the driver schedules, tracks, and collects status for every task. Millions of tiny tasks (the over-parallelism case) flood the driver with bookkeeping, and it becomes the bottleneck even though each task is trivial.
          alert
          Many small jobs in a tight loop: each action is a round-trip through the driver. A Python loop firing thousands of tiny actions serializes the cluster behind the driver.

          The diagnosis tell

          When executors look idle but the job is not progressing, look at the driver. Idle executors plus a busy driver is the signature of a driver bottleneck: the cluster is waiting to be told what to do. The fixes are structural, not just more hardware: avoid collect() on large data (write to storage instead), keep broadcasts small, and keep task counts sane so the driver is not drowning in scheduling overhead. A staff-level answer names the driver as a possible bottleneck unprompted, because most people forget it can be one.
          TIP

          Locality and Scheduling Delay

          Daily Life
          Interviews
          The last piece of the scheduler is locality: when the TaskScheduler places a task, it prefers an executor that already has the task's data nearby. Moving compute to the data is cheaper than moving data to the compute, so Spark ranks placement options and tries the best one first.
          Locality levelMeaningCost
          PROCESS_LOCALData is in the same executor's memoryCheapest; no movement
          NODE_LOCALData is on the same machine, different processCheap; a local read
          RACK_LOCALData is on the same rackSome network
          ANYData is anywhere; fetch it over the networkMost expensive

          The delay knob that can quietly cost you

          When the ideal executor is busy, the TaskScheduler faces a choice: wait a moment for it to free up, or give up locality and run the task somewhere it has to fetch the data. The wait is governed by spark.locality.wait (3 seconds by default). On most jobs this is invisible. But if many tasks are all waiting out their locality timeout, you can see a stage that is mysteriously slow with idle-looking executors: everyone is politely waiting for a preferred slot. Lowering the wait trades locality for not stalling, which is the right trade when your data is already well-distributed and a remote read is cheap.

          Locality is mostly something you observe, not tune, until it becomes a problem. The reason it closes this lesson is that it ties the whole model together: the scheduler is constantly balancing parallelism (keep every slot busy) against locality (keep compute near data), and that balance is one more place a job's wall-clock time is decided before a single row is processed.

          The failure edges separate tuning from understanding.

          Category
          SPARK
          Difficulty
          advanced
          Duration
          14 minutes
          Challenges
          6 hands-on challenges

          Topics covered: DAGScheduler vs TaskScheduler, Task Failure and Retry, Speculative Execution, The Driver as Bottleneck, Locality and Scheduling Delay

          Lesson Sections

          1. DAGScheduler vs TaskScheduler

            Inside the driver there are two schedulers, and they have a clean division of labor. Knowing the split is what lets you answer 'how does Spark turn my code into running tasks' without hand-waving. The handoff The flow is: your transformations become a logical plan, the DAGScheduler slices that plan into stages and orders them by their shuffle dependencies, and then for each ready stage it hands a TaskSet to the TaskScheduler, which dispatches the individual tasks to slots. The DAGScheduler decid

          2. Task Failure and Retry

            In a single database, a failed query just fails. In a cluster of hundreds of machines, something failing is normal: a node gets preempted, a network blip drops a connection, an executor runs out of memory. Spark is built to absorb that. A failed task is retried automatically, by default up to spark.task.maxFailures (4) times, before the whole stage, and then the job, is declared failed. The footgun: retries plus side effects Automatic retry is only safe because a recomputed partition produces th

          3. Speculative Execution

            Recall the barrier: a stage cannot finish until its slowest task does. Speculative execution is Spark's answer to a straggler caused by a bad machine. When a task runs far longer than its peers, Spark can launch a second copy of it on a different executor, and whichever finishes first wins; the other is killed. It is a bet that the slowness is the machine, not the data. The interview-grade distinction This is the cleanest way to show you understand the difference between a slow node and a slow p

          4. The Driver as Bottleneck

            We started by saying the driver does not touch your data. That is true, and yet the driver can still be the thing that makes your job slow or kills it. Because it is the single coordinator, a few patterns route real load through it, and at scale that load matters. This is the most underappreciated section in the whole lesson, because people instinctively blame the executors. The diagnosis tell When executors look idle but the job is not progressing, look at the driver. Idle executors plus a busy

          5. Locality and Scheduling Delay

            The last piece of the scheduler is locality: when the TaskScheduler places a task, it prefers an executor that already has the task's data nearby. Moving compute to the data is cheaper than moving data to the compute, so Spark ranks placement options and tries the best one first. The delay knob that can quietly cost you When the ideal executor is busy, the TaskScheduler faces a choice: wait a moment for it to free up, or give up locality and run the task somewhere it has to fetch the data. The w