How a Spark Job Runs
The Cluster: Who Plans, Who Works
The mental shift from a single database
Partitions: The Unit of Parallelism
Why this is the number that matters
| Partition count | What happens | Symptom |
|---|---|---|
| Far too few (e.g. 4) | Most of the cluster sits idle; a few huge tasks | Slow job, low CPU usage, possible memory blowups |
| Far too many (e.g. 500,000 tiny ones) | Scheduling overhead per task dominates the real work | Slow job, driver busy, tiny task times |
| About right (~128 MB each) | Every core stays busy, tasks finish in seconds | Steady high CPU usage across the cluster |
The 128 MB target is a default, not a law: it is set by spark.sql.files.maxPartitionBytes and it exists because that size balances task startup cost against keeping a partition in memory. You will tune it later. For now, hold onto the shape: a partition is a chunk of rows, and one task chews through exactly one chunk.
Transformations vs Actions
Why a database can hide this and Spark cannot
Cores and Slots
Slots and partitions decide your wall-clock time
This is the level at which 'add more executors' is sometimes right and sometimes useless. More slots help only if you have enough partitions to fill them. Adding 100 slots to a 4-partition job does nothing: 96 slots sit empty. The arithmetic tells you instantly whether more hardware can possibly help.
A Job's Life, End to End
The one-sentence version you should be able to say cold
Your query is a promise. Something has to keep it.
- Category
- SPARK
- Difficulty
- beginner
- Duration
- 12 minutes
- Challenges
- 7 hands-on challenges
Topics covered: The Cluster: Who Plans, Who Works, Partitions: The Unit of Parallelism, Transformations vs Actions, Cores and Slots, A Job's Life, End to End
Lesson Sections
- The Cluster: Who Plans, Who Works
When you run a query against Postgres, one process does the work. Spark splits that single role into three. The driver is the process running your program: it holds your code, builds the execution plan, and decides what work needs doing. It is the only part that sees the whole job. The executors are separate processes on separate machines that do the actual data crunching. The cluster manager is the layer that owns the pool of machines and hands executors to your job when it asks. The mental shi
- Partitions: The Unit of Parallelism
Your data does not arrive at an executor as one big table. Spark splits it into chunks called partitions. A partition is a contiguous slice of the rows, typically targeted around 128 MB, that lives in memory on one executor. A billion-row table might be 8,000 partitions. This split is the single most important idea in Spark, because it is the unit of parallelism: one task processes exactly one partition. 1 : 1 Why this is the number that matters Because tasks map one-to-one onto partitions, the
- Transformations vs Actions
Here is the thing that surprises everyone coming from SQL. When you write df.filter(...).groupBy(...).agg(...), nothing runs. Spark does not read a single row. You have only described work. The description is called a transformation, and transformations are lazy: they build up a plan and return immediately. The data moves only when you call an action. Why a database can hide this and Spark cannot In SQL, a statement is a complete unit: you send it, it runs, you get rows. There is no gap between
- Cores and Slots
An executor is not a single worker. It has N cores, and each core can run one task at a time. So an executor with 5 cores is processing 5 partitions simultaneously. Think of a core as a slot: a place a task can be running right now. Your total parallelism is the sum of all slots across all executors. Slots and partitions decide your wall-clock time Put the last two sections together. You have a fixed number of partitions (the work) and a fixed number of slots (the workers). If you have 200 parti
- A Job's Life, End to End
Now we narrate one full run, using only the pieces from B1 through B4. This is the answer to the single most common Spark interview opener: walk me through how Spark runs a job. The trick is to answer along the path the work actually travels, not as a list of vocabulary. The one-sentence version you should be able to say cold An action triggers the driver to plan the work, split the data into partitions, ask the cluster manager for executors, send one task per partition to the executor slots, an