How a Spark Job Runs: Scheduler Internals

At scale a single straggling partition, or one lost node mid-shuffle, decides whether a job finishes in ten minutes or ten hours, and no amount of partition tuning explains why until you look at the scheduler. You can already decompose any job into stages and tasks and reason about parallelism and deploy mode, which is enough to optimize most jobs. This advanced tier is about the failure edges: how the driver actually schedules and dispatches tasks, what happens when a task or a node dies, and the cases where the driver itself, not the executors, becomes the thing that is slow. This is the layer that separates 'I tuned the partitions' from 'I understand why the job behaves this way under failure and at scale.'

DAGScheduler vs TaskScheduler

Daily Life

Interviews

Inside the driver there are two schedulers, and they have a clean division of labor. Knowing the split is what lets you answer 'how does Spark turn my code into running tasks' without hand-waving.

•DAGScheduler

Works at the stage level
Takes the logical plan and cuts it into stages at shuffle boundaries
Builds the stage dependency graph (the DAG)
Submits one stage's tasks only when its parent stages are done

•TaskScheduler

Works at the task level
Takes a stage's tasks and places them on executor slots
Decides which task goes to which executor
Handles retries, locality, and straggler relaunches

The handoff

The flow is: your transformations become a logical plan, the DAGScheduler slices that plan into stages and orders them by their shuffle dependencies, and then for each ready stage it hands a TaskSet to the TaskScheduler, which dispatches the individual tasks to slots. The DAGScheduler decides what runs and in what order at the stage level; the TaskScheduler decides where each task runs and what to do when one fails. Every advanced behavior we cover next lives in one of these two: stage ordering and retry-of-a-whole-stage in the DAGScheduler, task placement and per-task retry in the TaskScheduler.

This is also why a job can show many stages 'pending' while one runs: the DAGScheduler will not submit a downstream stage until its parent's shuffle output exists. The dependency graph, not your code order, dictates what is allowed to run.

Multiple Choice

Two schedulers live in the driver. Which one works at the STAGE level, cutting the plan into stages at shuffle boundaries and ordering them?

Task Failure and Retry

Daily Life

Interviews

In a single database, a failed query just fails. In a cluster of hundreds of machines, something failing is normal: a node gets preempted, a network blip drops a connection, an executor runs out of memory. Spark is built to absorb that. A failed task is retried automatically, by default up to spark.task.maxFailures (4) times, before the whole stage, and then the job, is declared failed.

A single task fails: the TaskScheduler reruns just that task, possibly on a different executor. Because partitions are recomputed from their inputs, this is safe and cheap.

The same task keeps failing on one node: Spark can blacklist (exclude) that node so retries land elsewhere, isolating a sick machine instead of looping on it.

An executor dies mid-stage: every task it was running, and any shuffle output it held, must be recomputed elsewhere. This is why losing one executor late in a stage can be expensive.

A task fails maxFailures times: the stage fails, and the whole job aborts. The retry budget is per task, not infinite.

The footgun: retries plus side effects

Automatic retry is only safe because a recomputed partition produces the same result. The moment a task has a side effect that is not idempotent, for example writing to an external system row by row, a retry can double-write. This is the deep reason Spark pushes you toward idempotent, atomic writes: the engine assumes any task may run more than once. A task that is not safe to re-run is a bug waiting for the first node failure to expose it. We cover the write-side discipline in the fault tolerance lesson; here the takeaway is that retry is a feature you must design for, not just rely on.

TIP

If a job fails with 'Task failed 4 times,' read the FIRST failure, not the last. The later failures are often retries hitting the same root cause (a poison record, an OOM on one fat partition). The first stack trace is the real signal.

Multiple Choice

Spark retries a failed task automatically, so a task may run more than once. What property must a task's writes have for that retry to be safe?

Speculative Execution

Daily Life

Interviews

Recall the barrier: a stage cannot finish until its slowest task does. Speculative execution is Spark's answer to a straggler caused by a bad machine. When a task runs far longer than its peers, Spark can launch a second copy of it on a different executor, and whichever finishes first wins; the other is killed. It is a bet that the slowness is the machine, not the data.

•Speculation helps

A flaky disk on one node
A noisy neighbor stealing CPU
A transient network slowdown
The task's data is normal-sized; the node is the problem

•Speculation does not help (and wastes resources)

Data skew: the task is slow because its partition is huge
A second copy reads the same huge partition and is just as slow
You now burn two slots on the same doomed work
The fix is repartitioning or salting, not speculation

The interview-grade distinction

This is the cleanest way to show you understand the difference between a slow node and a slow partition. Speculation re-runs the same work elsewhere, so it only helps when the work itself is fine and the location is bad. Skew is the opposite: the work is genuinely larger, so re-running it changes nothing. Candidates who suggest 'turn on speculative execution' to fix skew reveal they have conflated the two. The honest answer is that speculation is for stragglers caused by the environment, and skew needs a data-layout fix.

Multiple Choice

Speculative execution relaunches a straggler task on another node. Which cause of a slow task can it NOT fix?

The Driver as Bottleneck

Daily Life

Interviews

We started by saying the driver does not touch your data. That is true, and yet the driver can still be the thing that makes your job slow or kills it. Because it is the single coordinator, a few patterns route real load through it, and at scale that load matters. This is the most underappreciated section in the whole lesson, because people instinctively blame the executors.

collect() pulls every result row to the driver's memory. On a big result this OOMs the driver instantly. The driver has one machine's memory; the executors have the whole cluster's.

Broadcasting a large variable: the driver assembles the broadcast and ships it to every executor. A too-large broadcast strains the driver and the network.

Very high task counts: the driver schedules, tracks, and collects status for every task. Millions of tiny tasks (the over-parallelism case) flood the driver with bookkeeping, and it becomes the bottleneck even though each task is trivial.

Many small jobs in a tight loop: each action is a round-trip through the driver. A Python loop firing thousands of tiny actions serializes the cluster behind the driver.

The diagnosis tell

When executors look idle but the job is not progressing, look at the driver. Idle executors plus a busy driver is the signature of a driver bottleneck: the cluster is waiting to be told what to do. The fixes are structural, not just more hardware: avoid collect() on large data (write to storage instead), keep broadcasts small, and keep task counts sane so the driver is not drowning in scheduling overhead. A staff-level answer names the driver as a possible bottleneck unprompted, because most people forget it can be one.

TIP

Replace collect() with write to a path, or with take(n) or limit(n) when you only need a sample. The number of times a real outage traces back to one collect() on a dataset that grew past the driver's memory is genuinely surprising.

> Executors look idle and the driver is out of memory. Fill the blank with the action most likely to have pulled the WHOLE result into the single driver.

order_items.___()

collect

filter

groupBy

cache

Locality and Scheduling Delay

Daily Life

Interviews

The last piece of the scheduler is locality: when the TaskScheduler places a task, it prefers an executor that already has the task's data nearby. Moving compute to the data is cheaper than moving data to the compute, so Spark ranks placement options and tries the best one first.

Locality level	Meaning	Cost
PROCESS_LOCAL	Data is in the same executor's memory	Cheapest; no movement
NODE_LOCAL	Data is on the same machine, different process	Cheap; a local read
RACK_LOCAL	Data is on the same rack	Some network
ANY	Data is anywhere; fetch it over the network	Most expensive

The delay knob that can quietly cost you

When the ideal executor is busy, the TaskScheduler faces a choice: wait a moment for it to free up, or give up locality and run the task somewhere it has to fetch the data. The wait is governed by spark.locality.wait (3 seconds by default). On most jobs this is invisible. But if many tasks are all waiting out their locality timeout, you can see a stage that is mysteriously slow with idle-looking executors: everyone is politely waiting for a preferred slot. Lowering the wait trades locality for not stalling, which is the right trade when your data is already well-distributed and a remote read is cheap.

Locality is mostly something you observe, not tune, until it becomes a problem. The reason it closes this lesson is that it ties the whole model together: the scheduler is constantly balancing parallelism (keep every slot busy) against locality (keep compute near data), and that balance is one more place a job's wall-clock time is decided before a single row is processed.

Multiple Choice

The scheduler prefers to run a task where its data already lives. Which locality level is the CHEAPEST, where nothing has to move?

❯❯❯PUTTING IT ALL TOGETHER

> An interviewer describes a nightly job that used to finish in 40 minutes and now runs for three hours. The Spark UI shows one stage with most executors idle, a handful of tasks that have been running far longer than their peers, and a driver process pinned at high CPU. They ask you to walk through how you would find the cause.

Start at the DAGScheduler level and read the stage graph, because pending stages waiting on a parent's shuffle output is normal ordering, not a symptom, and it tells you which single stage actually owns the lost time.

Idle executors next to a busy driver is the driver bottleneck signature, so check for a collect() on large data, an oversized broadcast, or a task count so high the driver drowns in scheduling overhead.

If the executors are busy but a few tasks run far longer than their peers, separate a slow machine from a slow partition: speculative execution launches a duplicate copy and helps only when the work is fine and the location is bad.

If the long tasks are genuinely processing more data, speculation changes nothing and the fix is data layout, which is the distinction that separates a tuning answer from an understanding answer.

If executors look idle without a busy driver, suspect locality: many tasks all waiting out spark.locality.wait at 3 seconds each stalls a stage, and lowering it trades locality for keeping slots full when data is already well distributed.

Close by noting that any retry, whether from spark.task.maxFailures or from speculation, assumes the task is safe to run twice, so a non idempotent write is the next thing to audit once the slowness is explained.

KEY TAKEAWAYS

The driver holds two schedulers with a clean split: the DAGScheduler slices the plan into stages and orders them by shuffle dependency, while the TaskScheduler decides where each task runs and how to handle a per task failure.

A downstream stage stays pending until its parent's shuffle output exists, so the dependency graph and not your code order dictates what is allowed to run.

Spark retries a failed task up to spark.task.maxFailures times, 4 by default, before failing the stage and then the job, and that retry is only safe because a recomputed partition is expected to produce the same result.

Any task with a non idempotent side effect, such as writing to an external system row by row, can double write on retry, so idempotent atomic writes are a design requirement rather than a nicety.

Speculative execution fixes stragglers caused by a bad machine by racing a second copy of the same task, and it does nothing for skew, where the work is genuinely larger; proposing speculation as a skew fix conflates the two.

Idle executors plus a busy driver means the driver is the bottleneck, and the fixes are structural: avoid collect() on large data, keep broadcasts small, and keep task counts sane.

The failure edges separate tuning from understanding.

Category: Spark
Difficulty: advanced
Duration: 14 minutes
Challenges: 6 hands-on challenges

Topics covered: DAGScheduler vs TaskScheduler, Task Failure and Retry, Speculative Execution, The Driver as Bottleneck, Locality and Scheduling Delay

Lesson Sections

DAGScheduler vs TaskScheduler
Inside the driver there are two schedulers, and they have a clean division of labor. Knowing the split is what lets you answer 'how does Spark turn my code into running tasks' without hand-waving. The handoff The flow is: your transformations become a logical plan, the DAGScheduler slices that plan into stages and orders them by their shuffle dependencies, and then for each ready stage it hands a TaskSet to the TaskScheduler, which dispatches the individual tasks to slots. The DAGScheduler decid
Task Failure and Retry
In a single database, a failed query just fails. In a cluster of hundreds of machines, something failing is normal: a node gets preempted, a network blip drops a connection, an executor runs out of memory. Spark is built to absorb that. A failed task is retried automatically, by default up to spark.task.maxFailures (4) times, before the whole stage, and then the job, is declared failed. The footgun: retries plus side effects Automatic retry is only safe because a recomputed partition produces th
Speculative Execution
Recall the barrier: a stage cannot finish until its slowest task does. Speculative execution is Spark's answer to a straggler caused by a bad machine. When a task runs far longer than its peers, Spark can launch a second copy of it on a different executor, and whichever finishes first wins; the other is killed. It is a bet that the slowness is the machine, not the data. The interview-grade distinction This is the cleanest way to show you understand the difference between a slow node and a slow p
The Driver as Bottleneck
We started by saying the driver does not touch your data. That is true, and yet the driver can still be the thing that makes your job slow or kills it. Because it is the single coordinator, a few patterns route real load through it, and at scale that load matters. This is the most underappreciated section in the whole lesson, because people instinctively blame the executors. The diagnosis tell When executors look idle but the job is not progressing, look at the driver. Idle executors plus a busy
Locality and Scheduling Delay
The last piece of the scheduler is locality: when the TaskScheduler places a task, it prefers an executor that already has the task's data nearby. Moving compute to the data is cheaper than moving data to the compute, so Spark ranks placement options and tries the best one first. The delay knob that can quietly cost you When the ideal executor is busy, the TaskScheduler faces a choice: wait a moment for it to free up, or give up locality and run the task somewhere it has to fetch the data. The w