How a Spark Job Runs: Scheduler Internals
DAGScheduler vs TaskScheduler
The handoff
This is also why a job can show many stages 'pending' while one runs: the DAGScheduler will not submit a downstream stage until its parent's shuffle output exists. The dependency graph, not your code order, dictates what is allowed to run.
Task Failure and Retry
The footgun: retries plus side effects
Speculative Execution
The interview-grade distinction
The Driver as Bottleneck
The diagnosis tell
Locality and Scheduling Delay
| Locality level | Meaning | Cost |
|---|---|---|
| PROCESS_LOCAL | Data is in the same executor's memory | Cheapest; no movement |
| NODE_LOCAL | Data is on the same machine, different process | Cheap; a local read |
| RACK_LOCAL | Data is on the same rack | Some network |
| ANY | Data is anywhere; fetch it over the network | Most expensive |
The delay knob that can quietly cost you
Locality is mostly something you observe, not tune, until it becomes a problem. The reason it closes this lesson is that it ties the whole model together: the scheduler is constantly balancing parallelism (keep every slot busy) against locality (keep compute near data), and that balance is one more place a job's wall-clock time is decided before a single row is processed.
The failure edges separate tuning from understanding.
- Category
- SPARK
- Difficulty
- advanced
- Duration
- 14 minutes
- Challenges
- 6 hands-on challenges
Topics covered: DAGScheduler vs TaskScheduler, Task Failure and Retry, Speculative Execution, The Driver as Bottleneck, Locality and Scheduling Delay
Lesson Sections
- DAGScheduler vs TaskScheduler
Inside the driver there are two schedulers, and they have a clean division of labor. Knowing the split is what lets you answer 'how does Spark turn my code into running tasks' without hand-waving. The handoff The flow is: your transformations become a logical plan, the DAGScheduler slices that plan into stages and orders them by their shuffle dependencies, and then for each ready stage it hands a TaskSet to the TaskScheduler, which dispatches the individual tasks to slots. The DAGScheduler decid
- Task Failure and Retry
In a single database, a failed query just fails. In a cluster of hundreds of machines, something failing is normal: a node gets preempted, a network blip drops a connection, an executor runs out of memory. Spark is built to absorb that. A failed task is retried automatically, by default up to spark.task.maxFailures (4) times, before the whole stage, and then the job, is declared failed. The footgun: retries plus side effects Automatic retry is only safe because a recomputed partition produces th
- Speculative Execution
Recall the barrier: a stage cannot finish until its slowest task does. Speculative execution is Spark's answer to a straggler caused by a bad machine. When a task runs far longer than its peers, Spark can launch a second copy of it on a different executor, and whichever finishes first wins; the other is killed. It is a bet that the slowness is the machine, not the data. The interview-grade distinction This is the cleanest way to show you understand the difference between a slow node and a slow p
- The Driver as Bottleneck
We started by saying the driver does not touch your data. That is true, and yet the driver can still be the thing that makes your job slow or kills it. Because it is the single coordinator, a few patterns route real load through it, and at scale that load matters. This is the most underappreciated section in the whole lesson, because people instinctively blame the executors. The diagnosis tell When executors look idle but the job is not progressing, look at the driver. Idle executors plus a busy
- Locality and Scheduling Delay
The last piece of the scheduler is locality: when the TaskScheduler places a task, it prefers an executor that already has the task's data nearby. Moving compute to the data is cheaper than moving data to the compute, so Spark ranks placement options and tries the best one first. The delay knob that can quietly cost you When the ideal executor is busy, the TaskScheduler faces a choice: wait a moment for it to free up, or give up locality and run the task somewhere it has to fetch the data. The w