How a Spark Job Runs

Netflix runs Spark jobs over more than a petabyte of viewing data a day, and the line of code an engineer writes to do it looks almost exactly like the SQL you already know. The difference is what runs it: not one database engine on one machine, but a fleet of hundreds, with your one transformation quietly split into thousands of small jobs running in parallel across them. Everything that makes Spark fast, and everything that makes it baffling when it is slow, comes from that one difference. The foundation of all of it? A cluster, a pile of partitions, and a single task.

The Cluster: Who Plans, Who Works

Daily Life

Interviews

Your database was one engine: it read your statement, planned the work, ran it, and handed back rows, all inside one process on one machine. Spark takes that single role and splits it across three separate actors. Get these three straight and most of Spark stops being mysterious, because every later idea in this lesson is really a statement about one of them.

The driver is the process that runs your program: it holds your code, turns it into a plan, and decides what work needs doing. There is exactly one driver per job, and it is the only part of the system that sees the whole picture. The executors are separate processes, usually on separate machines, that do the actual data crunching. A real cluster has many of them, and they do all the heavy lifting. The cluster manager is the layer that owns the pool of machines and rents executors out: when your job starts, the driver asks the manager for executors and is granted some.

DriverExecutorCluster manager

Driver

The brain

Runs your code, builds the execution plan, and hands out work. One per job. It directs; it never processes data itself.

Executor

The hands

A process on a worker machine that reads, filters, and aggregates real data on its partitions. A cluster has many.

Cluster manager

The landlord

Owns the machines and grants executors when the driver asks. YARN, Kubernetes, or Spark's own standalone manager.

The hardest part of this picture for someone coming from SQL is that the driver does not touch your data. It never sees a single row. It reads your code, builds a plan, ships that plan out to the executors, and then waits for them to report back. All of the data lives out on the executors, spread across the cluster, and the work travels to the data rather than the data travelling to one place. This is the central trick of distributed computing: moving a small description of the work to where the bytes already are is cheap, while pulling terabytes of bytes to one machine is ruinously expensive. When you internalise that the driver coordinates but does not compute, a whole class of Spark performance bugs becomes obvious in advance.

Why this matters the moment a job is slow

Because the work is split across three actors, the first useful question when a Spark job misbehaves is always: which of the three is the bottleneck? A busy driver usually means you asked it to do something it should not, like pulling a large result back to itself or preparing a huge broadcast. Starved executors mean the job did not get enough cores or memory to chew through the data. A stingy cluster manager means your job asked for executors and never received them, so it sits half-idle waiting for resources that are not coming. Naming the layer is most of the diagnosis. An engineer who can say "the executors are fine, the driver is the bottleneck because of that collect()" is already ahead of one who just says "Spark is slow."

•A single database (Postgres)

One process plans and executes the query
All the data lives in one place
You scale by buying a bigger machine
The planner and the worker are the same thing

•A Spark cluster

The driver plans; the executors execute
Data is spread across many machines
You scale by adding more machines
The planner (driver) never touches a row

That comparison is the whole mental shift in one table. A database scales vertically: when it is slow, you give it more CPU and memory and the same single engine runs faster. Spark scales horizontally: when it is slow, you add more executors and the work spreads wider. The catch, which the rest of this lesson unpacks, is that spreading work wider only helps if the work is actually divisible into independent pieces, and if those pieces are sized so every machine stays busy. Get that wrong and a hundred-machine cluster can be slower than your laptop. So the next question, naturally, is: how does Spark divide the data into pieces in the first place?

Multiple Choice

Which actor in a Spark cluster plans the job and hands out work, but never processes data itself?

Hold onto the three actors as you go: every later idea in this lesson is really a statement about one of them. Partitions and tasks are about how the executors divide and chew through data. Lazy evaluation is about what the driver does before it ships anything out. Cores and slots are about how much the executors can do at once. By the end you will be able to narrate a full job as a conversation between these three, which is exactly what a strong candidate does when an interviewer says "walk me through how Spark runs a job."

Partitions: The Unit of Parallelism

Daily Life

Interviews

Your data does not arrive at an executor as one big table. The very first thing Spark does with a dataset is cut it into chunks called partitions. A partition is a contiguous slice of the rows, typically targeted around 128 megabytes, that lives in memory on one executor. A billion-row table might become eight thousand partitions scattered across the cluster. This split is the single most important idea in all of Spark, because it is the unit of parallelism: one task processes exactly one partition, and nothing smaller. If you remember one sentence from this entire lesson, make it that one. Almost every tuning decision you will ever make is really a decision about how many partitions exist and how big each one is.

1 : 1

Tasks to partitions

Because tasks map one-to-one onto partitions, the partition count is a hard ceiling on how parallel a job can be. If your data sits in four partitions, then at most four things can happen at once, no matter how many machines you rented. Rent two hundred CPU cores for a four-partition job and a hundred and ninety-six of them sit idle while four tasks grind away. Conversely, if the data is in eight thousand partitions and you have two hundred cores, Spark works through them two hundred at a time, in waves. A SQL engine hides this from you completely; Spark makes it your decision, because getting it wrong is one of the most common reasons a job runs slowly, and one of the easiest to fix once you can see it.

Too many, too few, or just right

There are two ways to get partitioning wrong, and they fail in opposite directions. Too few partitions, and most of the cluster sits idle while a handful of enormous tasks crawl through oversized chunks of data, often running out of memory in the process. Too many partitions, and the per-task overhead takes over: every task has a fixed cost to launch, serialise, and report back, and when each task does only milliseconds of real work, that overhead dominates and the driver drowns in bookkeeping. The sweet spot is partitions sized so each one holds a sensible amount of data and each task does seconds of real work, not milliseconds and not minutes. The table below is the intuition you carry into every job.

Partition count	What happens	The symptom you'd see
Far too few (e.g. 4)	Most of the cluster idles; a few huge tasks	Slow job, low CPU usage, occasional out-of-memory errors
Far too many (e.g. 500,000 tiny ones)	Per-task scheduling overhead dominates the real work	Slow job, a busy driver, task times in milliseconds
About right (~128 MB each)	Every core stays busy; tasks finish in seconds	Steady, high CPU usage across the whole cluster

The 128 megabyte target is a default, not a law. It is governed by spark.sql.files.maxPartitionBytes, and it exists because that size balances the cost of starting a task against the cost of holding a partition in memory. You will learn to tune it. For now, hold the shape: a partition is a chunk of rows, one task chews through exactly one chunk, and the number of chunks sets how parallel the whole job can ever be.

Let us make this concrete with a real query. When you group and aggregate data, Spark has to bring all the rows for a given key together onto one partition before it can compute that key's result. That regrouping is the expensive part, and it is the first place partitioning visibly bites. The snippet below joins order lines to products and totals revenue per category. Toggle between PySpark and Scala, run it, and notice that the groupBy is the step where rows for one category, previously scattered across many partitions, are pulled together so they can be summed.

(order_items
   .join(products, "product_id")
   .groupBy("category")
   .agg(F.sum(F.col("quantity") * F.col("unit_price")).alias("revenue"))
   .orderBy(F.col("revenue").desc()))

Now try writing the regrouping step yourself. The scaffold below already reads the products table and aggregates; you supply the two pieces that make a per-category total: the operation that regroups the rows by key, and the aggregation that sums the price within each group. Both are ops you have just seen. Drag the right tiles into the blanks.

> Complete the aggregation so it totals price per category. You supply the wide operation that regroups rows by key, and the sum that runs within each group.

(products
   .___("category")
   .agg(F.___("price").alias("revenue")))

groupBy

sum

filter

orderBy

select

That groupBy is doing something physical: it is forcing every category's rows to meet on a single partition so they can be summed together. With ten categories you get a handful of busy partitions; with a million distinct keys you would get a million partitions, most nearly empty. The number of distinct keys you group by quietly decides how many partitions the result has, and therefore how parallel the next stage of the job will be. You are already, without thinking about it, making a partitioning decision every time you choose what to group by. The next section turns to the other thing that surprises every newcomer: when does any of this actually run?

Transformations vs Actions

Daily Life

Interviews

Here is the thing that catches everyone coming from SQL. When you write df.filter(...).groupBy(...).agg(...), nothing runs. Spark does not read a single row. You have only described work. The description is called a transformation, and transformations are lazy: each one adds a step to a plan and returns immediately, having done no computation at all. You can chain twenty of them and the cluster stays idle the entire time. The data actually moves only when you call an action. This is not a quirk to work around; it is the mechanism that lets Spark be fast, and understanding it is the difference between predicting a job's behaviour and being baffled by it.

</>Transformations (lazy)

filter, select, groupBy, join, withColumn
Return instantly and run nothing
Just append a step to the plan
You can chain dozens for zero cost

⚡Actions (eager)

count, collect, write, show, take
Block until real work finishes
Force the whole accumulated plan to execute
This is the line where the bill comes due

In SQL there is no gap between describing and executing: you send a statement, it runs, rows come back, all in one breath. Spark deliberately holds that gap open so it can see your entire chain of transformations before it commits to a plan. Seeing the whole chain is what lets it be clever: it can collapse adjacent steps, push a filter all the way down into the file read so it never loads rows it will throw away, and skip work you do not actually need. Laziness is not Spark being slow to start; it is Spark refusing to commit to how it runs your query until it has seen the whole query. The price you pay for that cleverness is a debugging quirk worth knowing about.

Why your stack trace lies to you

Because nothing runs until an action, the line of code that throws an error is almost never the line that is actually wrong. A typo in a filter thirty lines up sits harmlessly in the plan until the count() at the bottom finally forces everything to execute, and then the failure surfaces at the count(). New Spark users waste hours staring at the action, when the real culprit is some lazy transformation far above it. The habit to build is to read upward from the action: the action is merely where the accumulated bill for every transformation above it comes due. Once that clicks, the laziness stops feeling like a trap and starts feeling like a map of where the work really lives.

TIP

When debugging, treat the action as a payment, not a cause. An action like count() forces every lazy transformation above it to run, so an error or a slowdown that appears at the action almost always originates somewhere above. Read upward from the action to find the expensive or broken step.

Let us watch laziness in action. The snippet below filters the products table down to in-stock rows and then counts them per category. The filter is a lazy, narrow transformation: it adds a step to the plan and moves no data. Nothing happens until the result is requested. Run it and the whole chain fires at once; the filter, the grouping, and the count all execute in the single burst that the action triggers.

(products
   .filter(F.col("in_stock") == 1)
   .groupBy("category")
   .agg(F.count(F.lit(1)).alias("in_stock_count"))
   .orderBy(F.col("in_stock_count").desc()))

Now you complete one. The scaffold groups products and counts them; you supply the lazy narrow transformation that keeps only in-stock rows, and the grouping key it counts within. Remember that the filter you add does nothing on its own -- it only adds a step to the plan that the count will later force to run.

> Keep only in-stock products, then count them per category. You supply the lazy filter and the column the count groups by.

(products
   .___(F.col("in_stock") == 1)
   .groupBy("___")
   .agg(F.count(F.lit(1)).alias("n")))

filter

Cores and Slots

Daily Life

Interviews

An executor is not a single worker. It has a number of cores, and each core can run one task at a time. So an executor with five cores is processing five partitions simultaneously. The cleanest way to picture it is as slots: a slot is a place where a task can be running right now. Your total parallelism -- the number of tasks that can be in flight across the whole cluster at any instant -- is simply the sum of all the slots on all the executors. This is the number that, together with the partition count from earlier, decides how long your job takes on the wall clock.

50 slots

10 executors x 5 cores

Now put the last two sections together, because this is where the whole execution model snaps into focus. You have a fixed number of partitions, which is the work, and a fixed number of slots, which is the workforce. If a stage has two hundred partitions and the cluster has fifty slots, Spark runs the partitions in four waves of fifty: fifty tasks start, finish, and the next fifty begin, and so on. Each wave takes about as long as its slowest task, and the stage takes about four of those. That single calculation -- partitions divided by slots, rounded up to a count of waves -- is the back-of-the-envelope model you use to reason about almost any Spark job's runtime.

4 waves

200 partitions / 50 slots

When more hardware does nothing

The slots-and-partitions arithmetic explains one of the most common and most expensive mistakes in Spark: throwing hardware at a job that cannot use it. If a stage has only forty partitions and you give it two hundred slots, a hundred and sixty slots sit empty the entire time, because there is no work to put in them -- a task cannot be split across slots. Adding executors here costs money and changes nothing. The fix is more partitions, not more hardware. The reverse mistake is just as real: if every partition is enormous, one slot can be stuck on a single oversized task for many minutes while the others finish, and the whole stage waits on it. That stuck-on-one-partition pattern is the seed of every skew problem you will ever debug, and it falls straight out of this same arithmetic.

This is exactly why "just add more executors" is sometimes the right call and sometimes useless. More slots help only when there are enough partitions to fill them. Before you scale the cluster, do the division: if partitions are already fewer than your current slots, more hardware cannot help, and the lever you actually want is repartitioning the data.

Multiple Choice

A stage has 200 partitions and the cluster has 50 slots (executors x cores). How many waves does the stage run in?

You now have both halves of the runtime model: partitions are the work, slots are the workforce, and the job runs in waves of the latter chewing through the former. There is one last thing to assemble before you can narrate a complete job out loud, and it is simply putting every piece so far into a single end-to-end story. That is the final section.

A Job's Life, End to End

Daily Life

Interviews

Now we narrate one full run, using only the pieces you have built: driver, executors, cluster manager, partitions, tasks, slots, transformations, and actions. This is the answer to the single most common Spark interview opener -- "walk me through how Spark runs a job" -- and the trick to answering it well is to follow the path the work actually travels, rather than reciting a list of vocabulary. Each step hands off to the next, and naming the hand-offs in order is what separates a confident answer from a vague one.

You call an action, say count(). Until this moment everything was lazy description; the action is what wakes the job up and tells Spark to actually do the work.

The driver turns your chain of transformations into a plan: it figures out which data to read, how to split it into partitions, and where the expensive steps fall.

The driver asks the cluster manager for executors and packages the work into tasks -- one task per partition.

The driver ships those tasks to the executor slots. Each slot runs its task against its one partition, in waves if there are more partitions than slots.

Executors send results, or a partial summary like a per-partition count, back to the driver, which assembles the final answer and hands it to you.

Read that sequence twice, because being able to produce it smoothly is worth more in an interview than almost any single piece of trivia. Notice how every actor and every concept from this lesson appears exactly once, in the order the work flows through them. The driver plans, the cluster manager grants, the executors run, the partitions define the tasks, the slots run them in waves, and the action is the spark that set the whole thing in motion. Nothing here is memorised in isolation; it is one story with a beginning, a middle, and an end.

The one-sentence version to say cold

Compress the whole sequence into a single breath and you have the answer to keep in your back pocket: an action triggers the driver to plan the work, split the data into partitions, ask the cluster manager for executors, send one task per partition to the executor slots, and collect the results back. If you can say that cleanly and then stop, you have answered the question better than most candidates, who either drown in API names or never reach the word "partition" at all. The strongest version ends by naming where the cost lives -- the shuffle, which the next lesson is entirely about.

Let us run one complete job that exercises the whole path: it reads two tables, joins them, filters, groups, aggregates, and orders the result. As you run it, narrate it to yourself using the five steps above. The action at the end is what makes every lazy step before it fire; the join and the groupBy are where data moves between partitions; everything else streams row by row on the executors.

(order_items
   .join(products, "product_id")
   .filter(F.col("in_stock") == 1)
   .groupBy("category")
   .agg(F.sum("quantity").alias("units_sold"))
   .orderBy(F.col("units_sold").desc()))

That job touched every idea in the lesson. It read data split into partitions, it had the driver plan a chain of lazy transformations, it ran one task per partition across the executor slots in waves, and an action at the end forced the whole thing to execute and pulled the small result back to the driver. You can now look at any Spark job and see the machine underneath it.

✓Do

Reason about a job in execution-path order: action, then plan, then partitions, then tasks on slots, then results.
Match partition count to your slots so every executor core has a task to run instead of sitting idle.
Reach for an action (count, collect, write) only when you actually need a result, since each one forces the whole plan to run.
Pull only small results back to the driver with collect; write large results out from the executors instead.

✗Don't

Don't assume more machines means a faster job; extra slots do nothing when there are not enough partitions to fill them.
Don't expect a transformation to do any work on its own; nothing runs until an action triggers it.
Don't collect a large dataset to the driver; it is a single machine and will run out of memory.
Don't picture the driver processing rows; it plans and directs, while the executors are the only actors that touch data.

❯❯❯PUTTING IT ALL TOGETHER

> You are a data engineer at an online marketplace asked to build a daily report of units sold per product category. The catalog and the order history are far too large for one machine, so the job runs on a Spark cluster: it reads the data, joins orders to products, filters to in-stock items, and aggregates by category.

The driver receives your code, builds the plan, and hands work to the executors, but it never reads a single order row itself.

The order data arrives already split into partitions, and Spark runs one task per partition, so the number of partitions decides how parallel the read and join can be.

The join, filter, and aggregate are lazy: nothing runs until the write at the end is the action that forces the whole plan to execute.

If the cluster has more slots than there are partitions, the extra slots sit idle, so repartitioning is the lever that actually speeds the job up.

KEY TAKEAWAYS

The driver plans and directs but never touches data; executors do the real work; the cluster manager owns the machines and grants executors.

Data is split into partitions, and one task processes exactly one partition, so partition count is the ceiling on parallelism.

Transformations are lazy descriptions that run nothing; an action is what forces the whole accumulated plan to execute.

Total parallelism is slots (executors times cores); wall-clock time is roughly partitions divided by slots, run in waves.

More hardware only helps when there are enough partitions to fill the extra slots; otherwise the lever you want is repartitioning.

The execution path is one story: action, then plan, then partitions, then tasks, then slots, then results.

Your query is a promise. Something has to keep it.

Category: Spark
Difficulty: beginner
Duration: 12 minutes
Challenges: 7 hands-on challenges

Topics covered: The Cluster: Who Plans, Who Works, Partitions: The Unit of Parallelism, Transformations vs Actions, Cores and Slots, A Job's Life, End to End

Lesson Sections

The Cluster: Who Plans, Who Works
Your database was one engine: it read your statement, planned the work, ran it, and handed back rows, all inside one process on one machine. Spark takes that single role and splits it across three separate actors. Get these three straight and most of Spark stops being mysterious, because every later idea in this lesson is really a statement about one of them. The driver is the process that runs your program: it holds your code, turns it into a plan, and decides what work needs doing. There is ex
Partitions: The Unit of Parallelism
Your data does not arrive at an executor as one big table. The very first thing Spark does with a dataset is cut it into chunks called partitions. A partition is a contiguous slice of the rows, typically targeted around 128 megabytes, that lives in memory on one executor. A billion-row table might become eight thousand partitions scattered across the cluster. This split is the single most important idea in all of Spark, because it is the unit of parallelism: one task processes exactly one partit
Transformations vs Actions
Here is the thing that catches everyone coming from SQL. When you write df.filter(...).groupBy(...).agg(...), nothing runs. Spark does not read a single row. You have only described work. The description is called a transformation, and transformations are lazy: each one adds a step to a plan and returns immediately, having done no computation at all. You can chain twenty of them and the cluster stays idle the entire time. The data actually moves only when you call an action. This is not a quirk
Cores and Slots
An executor is not a single worker. It has a number of cores, and each core can run one task at a time. So an executor with five cores is processing five partitions simultaneously. The cleanest way to picture it is as slots: a slot is a place where a task can be running right now. Your total parallelism -- the number of tasks that can be in flight across the whole cluster at any instant -- is simply the sum of all the slots on all the executors. This is the number that, together with the partiti
A Job's Life, End to End
Now we narrate one full run, using only the pieces you have built: driver, executors, cluster manager, partitions, tasks, slots, transformations, and actions. This is the answer to the single most common Spark interview opener -- "walk me through how Spark runs a job" -- and the trick to answering it well is to follow the path the work actually travels, rather than reciting a list of vocabulary. Each step hands off to the next, and naming the hand-offs in order is what separates a confident answ