Lazy Until You Ask

Spotify rebuilds Discover Weekly for hundreds of millions of listeners every Monday, and the Spark code that does it is mostly a long chain of transformations that, for most of its life, runs nothing at all. That silence is the idea that trips up almost everyone arriving from SQL: in Spark, writing a transformation and running it are two separate events. You build a recipe, and only a specific kind of call ever turns the stove on. Get the two words right, transformation and action, and the rest of this tier falls into place.

Nothing Runs Until an Action

Daily Life

Interviews

In a database, when you press run, the query runs. In Spark, when you write a transformation, nothing happens. You can chain a filter onto a select onto a join onto a groupBy, building a description ten lines long, and not a single byte of data has moved. Spark has simply written down what you asked for. The work only begins when you call an action, a special kind of method that demands an actual result: a count, the rows themselves, or a write to disk.

This split is called lazy evaluation, and the name fits. A transformation is lazy because it postpones the work; it returns a new DataFrame that remembers the operation but performs none of it. An action is eager: it forces the whole chain of postponed transformations to execute, right now, so it can hand you the result you asked for. Until an action fires, your DataFrame is a promise, not a table.

•Transformation (lazy)

filter, select, groupBy, join, withColumn
Returns a new DataFrame instantly
Runs NO work, moves NO data
Just records what you asked for

•Action (eager)

count, collect, show, write, take
Demands a real result
Forces the whole chain to execute
This is the moment the job runs

Build a chain and watch it stay silent. You read the orders table, filter to completed orders, group by region, and sum the profit. Three transformations, and your program flies through all three instantly, because nothing ran. Then you call count on the result, and the cluster lights up: it reads the data, filters it, shuffles it to group by region, sums each group, and counts the groups, all in one burst triggered by that single count. The action ignites everything stacked up behind it.

Watch it fire all at once

The snippet below builds a chain of transformations and then takes an action. Run it, and remember: every line before the action did nothing until the action arrived and forced the whole thing to run in one pass. Toggle between PySpark and Scala and notice the chain reads identically; the laziness is a property of Spark, not of the language you write it in.

(orders
   .filter(F.col("status") == "Completed")
   .groupBy("region")
   .agg(F.sum("profit").alias("total_profit"))
   .orderBy(F.col("total_profit").desc()))

When you ran that, the orderBy at the end did not trigger execution: in a notebook, displaying the result is the action. The rule is simple and absolute. Building the chain is free and instant. The result you can see, count, or save is what costs, and it costs the whole chain at once. Everything else in this lesson follows from that one rule.

It helps to translate this into the SQL mental model you already carry. In Postgres, a SELECT statement is a unit: you write it, you submit it, it runs. There is no moment where the query exists but has not executed. Spark splits that single act into two. The transformation chain is like composing the SQL text in an editor, where you can type a WHERE, then a JOIN, then a GROUP BY, and nothing happens because you have not pressed run. The action is pressing run. What changes is that Spark lets you hold that unexecuted query as a real object, pass it around, branch it, and extend it, all before a single row moves. That is what laziness gives you, and it is why a Spark program reads like building a description rather than issuing commands.

There is a practical surprise hidden in this for anyone debugging Spark for the first time. Because transformations run nothing, an error in your logic, a wrong column name, a type mismatch, a bad join key, does not appear on the line where you wrote it. It appears later, at the action, because that is the first moment Spark actually tries to execute the plan and discovers the problem. Engineers new to Spark often stare at the action line in confusion when the real mistake was ten lines earlier in a transformation. Once you know that the action is where the whole chain is checked and run, the stack traces stop being mysterious: read upward from the action through the transformations that fed it.

Why Laziness Makes Spark Fast

Daily Life

Interviews

Laziness can feel like an annoyance when you are debugging and an error only shows up three lines later, at the action, instead of where you wrote the typo. But it is the reason Spark is fast, and the designers chose it deliberately. Because Spark sees your entire chain before it runs anything, it can look at the whole plan and rearrange it for efficiency, the same way a good query optimizer does for SQL.

Think about what an eager system would have to do. If every transformation ran the instant you wrote it, Spark would read the full table to apply your filter, materialize that result, then read it again to apply your select, and so on, paying a full pass for every single step. By waiting, Spark can fuse your filter and your select into a single pass over the data, and it can push your filter down to the very beginning so it reads less data in the first place. The chain you wrote in a convenient order gets quietly reorganised into the cheapest order that produces the same answer.

It sees the whole plan

Spark reads your entire chain before running, so it can optimise across operations instead of one at a time.

It collapses redundant work

Adjacent narrow steps fuse into one pass; a filter slides to the front so less data is ever read.

It skips work you do not need

If you only ask for a count, Spark need not carry every column through the chain, just enough to answer.

Here is a concrete payoff. Suppose you select ten columns, then later in the chain keep only two. An eager system carries all ten columns through every step. A lazy system sees, at planning time, that only two columns survive to the end, and reads only those from disk. This is column pruning, and it is free, because Spark saw the end of your chain before it touched the start of it. The same logic applies to filters: a WHERE that you wrote late gets executed early, so the expensive steps run on less data.

Walk it through the seed tables to make it concrete. Say you read the orders table, which is wide and carries many columns, join it to customers, and at the very end keep only the order_id and the customer region. An eager engine reads every column of orders and every column of customers, drags all of them across the expensive join, and discards the unused ones only at the final select. Spark, seeing the whole chain first, knows before it reads a byte that only a handful of columns reach the end. It reads just those from the source, carries only those through the join, and the join itself moves far less data. You wrote the query for readability and Spark ran it lean, and you did nothing to earn that except let Spark see the whole plan.

The same answer, less work

The fill-in below builds a small chain whose order does not match the cheapest execution order, and Spark will fix that for you at planning time. Your job is to complete the two operations that define what the chain produces. Both pieces are things you have already met: keeping only the rows you want, and choosing the columns you return.

> From products, keep only in-stock items (in_stock = 1) and return their name and category, ordered by name. The filter and the column choice are the two operations to supply; Spark will run the filter first regardless of where you place it.

(products
   .___(F.col("in_stock") == 1)
   .___("product_name", "category")
   .orderBy("product_name"))

filter

select

groupBy

agg

Whether you wrote the filter before or after the select, Spark plans the filter to run first, because filtering early means the select touches fewer rows. You wrote the chain for clarity; Spark ran it for speed. That gap between how you write and how Spark runs is what laziness buys you, and the intermediate tier is one long look at the plan that lives in it.

The Action Catalog: What Actually Triggers a Run

Daily Life

Interviews

If actions are the only thing that runs a job, then knowing which calls are actions is what lets you predict your job's behaviour instead of being surprised by it. The list is short and the logic is consistent: an action is any call that needs to produce a concrete result outside the lazy DataFrame world, either a value back in your program or bytes written to storage.

Action	What it produces	Where the result goes
count()	The number of rows	A number, back to the driver
collect()	Every row, as a local list	All data, back to the driver
take(n) / show()	The first n rows	A few rows, back to the driver
write...save()	The full result on disk	Out to storage, from the executors
first() / head()	One row	A single row, back to the driver

Everything not on that list, the filters and selects and joins and groupBys you build your logic from, is a transformation, and transformations never run on their own. A reliable test when you are unsure: ask what the call returns. If it returns another DataFrame, it is a lazy transformation. If it returns a number, a list of rows, or nothing (because it wrote to disk), it is an eager action that just triggered your whole chain.

TIP

In an interview, if asked whether a line runs anything, ask what it returns. "groupBy returns a DataFrame, so it is lazy and runs nothing; count returns an integer, so it is the action that fires the job." That distinction, applied out loud, reads as someone who understands the execution model.

There is a useful asymmetry inside the action list. show and take are bounded actions: they ask for a fixed, small number of rows, and Spark can often satisfy them without running the full chain over the entire dataset, because once it has enough rows to fill the request it can stop. count and collect are unbounded: count must touch every row to know how many there are, and collect must gather all of them. So two actions that both look like quick checks can differ wildly in cost. A take(5) on a billion-row orders table might read only the first partition or two; a count on the same table reads all of it. When you reach for an action just to peek at your data, prefer the bounded ones, because they let Spark do the least work that answers your question.

Trigger it deliberately

The demo below builds a chain and ends in an aggregation. In a notebook, displaying the result is the action that runs it. Read the code as a recipe that stays cold until something asks to see the answer.

(order_items
   .groupBy("product_id")
   .agg(F.sum("quantity").alias("units_sold"))
   .orderBy(F.col("units_sold").desc()))

Notice that nothing in that chain is an action by itself. The groupBy, the agg, and the orderBy are all transformations; they describe an aggregation without performing it. The job only runs because something, the display, the count, the write, finally asked for the rows. Look at the end of a chain to find the action; that is where your job actually begins.

One subtlety catches people: count and a few other actions return a small value even though the chain behind them may be enormous. count returns a single integer, but to produce it Spark still reads, filters, joins, and groups the entire dataset. The smallness of the returned value tells you nothing about the cost of producing it. This is why a stray count, dropped in to check whether a step worked, can quietly cost minutes: it is a full execution of everything above it, even though all you get back is a number. Treat every action as a request to run the whole pipeline, regardless of how small its result looks.

The collect() Trap

Daily Life

Interviews

One action deserves its own section because it is a common way to take down a Spark job: collect. It collects every row of your result and pulls it back to the driver as a local list. On a small result that is fine and useful. On a large one it is a disaster, because the driver is a single machine with a single machine's memory, and you are asking it to hold data that was spread across the whole cluster because it did not fit on one machine.

The failure mode is abrupt and recognisable. The job runs fine across the executors, the final stage completes, and then the driver tries to assemble the full result in its own heap and runs out of memory. The stack trace points at the driver, not the executors, which confuses people who assumed the data problem was out in the cluster. The data was never the problem; pulling all of it to one place was.

1 machine

the driver's memory ceiling

collect is treacherous because it works perfectly in development and fails only in production. On your laptop or against a trimmed-down sample of order_items, the result is small, collect returns instantly, and the code looks correct. Then it ships, runs against the full table that has grown by a thousandfold, and the same line that was fine on ten thousand rows tries to pull a billion onto the driver. The bug was always there; it stayed invisible until the data got big. Distributed-systems bugs often take this shape, and collect is the textbook case: correctness at small scale tells you nothing about behaviour at the scale you actually run.

The fix is almost always to not collect. If you want to inspect the data, use show or take, which pull back only a handful of rows. If you want to keep the result, write it to storage with write, which streams it out from the executors in parallel and never funnels it through the driver. The only time collect is appropriate is when you genuinely need the full result as a local object and you know it is small, a few thousand rows of summary, not the raw data.

⚠collect() (dangerous at scale)

Pulls EVERY row to the driver
Bounded by one machine's memory
OOMs the driver on a big result
Use only on a known-small result

✓show() / take(n) / write (safe)

show/take pull only a few rows
write streams out from executors
Never funnels all data through one node
The right default for inspection or saving

TIP

A senior tell in code review: spotting a collect() on an unbounded result and replacing it with a write or a bounded take. The reviewer who catches "this collect will OOM the driver once the table grows" is reasoning about where the data lands, not just whether the logic is correct.

It helps to put rough numbers on it. Say each executor holds a comfortable few gigabytes of your data and you have fifty of them, so the dataset is well over a hundred gigabytes spread across the cluster. The driver might have eight gigabytes of heap. A collect asks that eight-gigabyte driver to receive a hundred-plus gigabytes of rows. It does not matter that the fifty executors handled the data comfortably; the bottleneck is the one machine you funnelled everything through. The cluster size that made the computation fast is irrelevant to the collect, because collect does not use the cluster to hold the result, it uses the driver alone.

The underlying habit is to always know WHERE a result lands. Transformations keep data out on the executors where it belongs. Most actions bring a small summary back to the driver, which is fine. collect brings everything back, which is the trap. Keep the data distributed until the last possible moment, and when you do pull it in, pull only what one machine can hold.

Re-Execution: The Chain Runs Again Every Time

Daily Life

Interviews

Here is a consequence of laziness that surprises almost everyone, and it is the bridge to the next big topic. A DataFrame remembers how to produce itself, not the data it produces. So when you call two actions on the same chain, Spark runs the entire chain twice, once for each action. It does not quietly remember the result from the first action and reuse it. It recomputes from the original source every single time you ask.

Say you build an expensive chain, a big join followed by an aggregation, and then you call count to see how many rows it has, and later call show to look at a few of them. Those are two actions, so Spark reads the source data and runs the whole expensive join and aggregation twice. If that chain took ten minutes, you just spent twenty. Nothing warned you, because each action looked like a single innocent call.

Why it recomputes instead of remembering

This behaviour follows straight from the definition. A DataFrame is a recipe over its source, so asking for a result means running the recipe. Spark has no reason to assume you want the result kept around, taking up memory, after the action that needed it finishes. So by default it throws the intermediate work away and rebuilds it the next time. For a chain you use once, that is the right call. For a chain you use several times, it is wasteful.

What you do	What Spark does	The cost
One action on the chain	Runs the chain once	Paid once, correct
Two actions on the same chain	Runs the whole chain twice	Paid twice, often wasteful
The same chain reused in a loop	Re-runs it every iteration	Paid N times, usually a bug

The loop case is where this turns from wasteful into genuinely broken. Suppose you build an expensive chain once, then loop over a list of regions, filtering the chain to each region and calling show inside the loop. Every iteration is an action, so Spark re-runs the entire expensive chain from the source for every region in your list. Ten regions means ten full executions of work you thought you did once. The code looks like it computes the base result a single time and slices it cheaply; Spark sees ten independent requests and obliges each one from scratch. This is one of the most common silent performance bugs in real pipelines.

The reason this surprises people coming from SQL is that a database materialises differently. If you create a temporary table or a CTE that the engine spools, you can query it many times and it computes once. A Spark DataFrame behaves more like a view definition, which the database also re-evaluates every time you select from it. So the SQL analogy for a reused Spark chain is a view, not a temp table: convenient to reference, but re-run on every reference unless you deliberately materialise it. Once you hold that analogy, the re-execution stops feeling like a bug in Spark and starts feeling like the predictable behaviour of a thing defined as a recipe rather than stored as data.

The fix is to tell Spark, explicitly, to hold onto a result you intend to reuse. That is what caching does, and a later lesson on memory covers it in full. For now, just stay aware: if you find yourself calling several actions on one expensive chain, you are paying for that chain several times unless you intervene. Recognising the re-execution is the first step; the cache is the cure.

✓Do

Reach for an action only when you actually need a result; each one runs the whole chain.
Use show or take to inspect data, and write to save it, instead of collect.
When you call several actions on one expensive chain, plan to cache the shared result.
Read the end of a chain to find the action; that is the line that triggers the job.

✗Don't

Don't assume a DataFrame caches itself after the first action; by default it recomputes.
Don't collect() an unbounded result; the driver is one machine and will OOM.
Don't expect a transformation to do work on its own; nothing runs until an action.
Don't bury a show() in a loop; each call re-runs the entire pipeline.

❯❯❯PUTTING IT ALL TOGETHER

> You are a data engineer at a streaming service computing a daily report. You build an expensive chain that joins viewing events to a content catalog and aggregates watch time per title, then you call count to log how many titles appeared and write to save the report.

Building the join-and-aggregate chain runs nothing; it is all lazy transformations recording what you asked for.

The count is your first action, and it forces the entire expensive chain to read, join, and aggregate from the source.

The write is a second action, so without intervention Spark runs the whole expensive chain a second time to produce the output.

Recognising that double cost, you would cache the aggregated result before the count so the write reuses it instead of recomputing.

KEY TAKEAWAYS

A transformation is lazy and records work; an action is eager and forces the whole chain to run.

Laziness lets Spark see the whole plan, so it fuses steps, pushes filters down, and prunes columns for free.

An action is any call that returns a value or writes data: count, collect, show, take, write.

collect() pulls every row to the single driver and OOMs it on a large result; use show, take, or write.

Two actions on one chain recompute it twice; reusing a chain means caching the shared result.

You wrote a recipe. Nothing cooks until you call an action.

Category: Spark
Difficulty: beginner
Duration: 13 minutes
Challenges: 3 hands-on challenges

Topics covered: Nothing Runs Until an Action, Why Laziness Makes Spark Fast, The Action Catalog: What Actually Triggers a Run, The collect() Trap, Re-Execution: The Chain Runs Again Every Time

Lesson Sections

Nothing Runs Until an Action
In a database, when you press run, the query runs. In Spark, when you write a transformation, nothing happens. You can chain a filter onto a select onto a join onto a groupBy, building a description ten lines long, and not a single byte of data has moved. Spark has simply written down what you asked for. The work only begins when you call an action, a special kind of method that demands an actual result: a count, the rows themselves, or a write to disk. This split is called lazy evaluation, and
Why Laziness Makes Spark Fast
Laziness can feel like an annoyance when you are debugging and an error only shows up three lines later, at the action, instead of where you wrote the typo. But it is the reason Spark is fast, and the designers chose it deliberately. Because Spark sees your entire chain before it runs anything, it can look at the whole plan and rearrange it for efficiency, the same way a good query optimizer does for SQL. Think about what an eager system would have to do. If every transformation ran the instant
The Action Catalog: What Actually Triggers a Run
If actions are the only thing that runs a job, then knowing which calls are actions is what lets you predict your job's behaviour instead of being surprised by it. The list is short and the logic is consistent: an action is any call that needs to produce a concrete result outside the lazy DataFrame world, either a value back in your program or bytes written to storage. Everything not on that list, the filters and selects and joins and groupBys you build your logic from, is a transformation, and
The collect() Trap
One action deserves its own section because it is a common way to take down a Spark job: collect. It collects every row of your result and pulls it back to the driver as a local list. On a small result that is fine and useful. On a large one it is a disaster, because the driver is a single machine with a single machine's memory, and you are asking it to hold data that was spread across the whole cluster because it did not fit on one machine. The failure mode is abrupt and recognisable. The job r
Re-Execution: The Chain Runs Again Every Time
Here is a consequence of laziness that surprises almost everyone, and it is the bridge to the next big topic. A DataFrame remembers how to produce itself, not the data it produces. So when you call two actions on the same chain, Spark runs the entire chain twice, once for each action. It does not quietly remember the result from the first action and reuse it. It recomputes from the original source every single time you ask. Say you build an expensive chain, a big join followed by an aggregation,