Lineage as Fault Tolerance

Run a job across thousands of tasks for several hours and at least one machine will die before it finishes. Count on it. The reason a lost node does not restart the whole job from zero is lineage: Spark's record of how every partition was built, which lets it recompute just the lost piece. You met lineage as the backward-read of the dependency graph. Here we make it real, covering how recovery actually works, why long lineages get expensive, and the tools that cut a lineage before it cuts you. Underneath it all is one idea: Spark stores a partition not as data it trusts to survive, but as a recipe it can always rebuild.

Lineage-Based Recovery: Rebuild, Don't Re-Read

Daily Life

Interviews

When an executor dies mid-job, it takes its partitions of in-flight data with it. A naive system would have to start over, because that data is gone. Spark does not, and the reason is that it never treated those partitions as precious irreplaceable data in the first place. It treated them as the output of a known recipe. The lineage is that recipe, recorded per partition, and recovery is just running the recipe again for the partitions that were lost.

Say a partition was produced by reading a chunk of the source, filtering it, and mapping over it. The lineage records just that: this partition came from that source chunk via those narrow operations. When the executor holding it dies, Spark looks up the lineage, sees the recipe, re-reads that one source chunk, and re-applies the filter and map to rebuild the single lost partition. The other thousands of partitions are untouched. The job loses a little time recomputing one piece, not all of it.

Recompute the lost piece only

Lineage lets Spark rebuild just the partitions an executor took with it, not restart the whole job.

The recipe, not the result, is durable

A partition is defined by how it was made, so it can always be remade from its inputs.

Recovery cost depends on the recipe

A short, narrow lineage is cheap to replay; a long one with wide dependencies is expensive.

This is why Spark is fault tolerant without writing every intermediate result to reliable storage, which would be far too slow. It trades storage for recomputation: keep the cheap recipe, throw away the intermediate data, and rebuild on demand if a failure forces it. For narrow lineages this trade is almost free. The rest of this tier is about when it stops being free, and what you do about it.

A chain whose lineage we can trace

Run the chain below and picture the lineage Spark records for its output partitions. The join is a wide dependency, so any partition produced after it carries a lineage that fans back into many input partitions on both sides; the filter and aggregation above it add narrow and wide steps on top. If an executor died holding one of these output partitions, the lineage is precisely the recipe Spark would replay to rebuild it.

(order_items
   .join(products, "product_id")
   .filter(F.col("in_stock") == 1)
   .groupBy("category")
   .agg(F.sum("quantity").alias("units"))
   .orderBy(F.col("units").desc()))

That output sits behind two wide dependencies, the join and the groupBy, so recovering one of its partitions is not the cheap narrow replay; it pulls in a fan-in of upstream data and re-does shuffles. This is the kind of lineage that the next sections argue you should consider checkpointing, because replaying it after a failure is expensive enough to justify storing the result instead.

One detail that makes recovery less expensive than it first sounds: Spark does not always replay from the original source. The output of every shuffle is written to local disk on the executors that produced it, and those shuffle files survive the task that wrote them. So when a partition on the far side of a shuffle is lost, Spark can often rebuild it by re-reading the persisted shuffle output rather than rerunning everything upstream of the shuffle. The lineage replay stops at the nearest surviving materialised data, which is frequently a shuffle boundary, not the source. This is why a failed task usually costs you one stage of rework and not the whole job. The catch is that if the executor holding those shuffle files is the one that died, the files die with it, and then the replay does have to reach further back.

The Recompute Cost: When Lineage Gets Expensive

Daily Life

Interviews

Recovery is cheap when the lineage is short and narrow, because replaying it touches little data and moves none across the network. It gets expensive in two ways, and once you can recognise them you can engineer for the failure instead of trusting fault tolerance blindly.

The first is length. Every transformation you chain adds a step to the lineage of the partitions it produces. A pipeline with hundreds of transformations, common in iterative algorithms that loop and refine, builds a very long lineage, and recovering a lost partition near the end means replaying that entire long recipe from the source. The recovery is correct, but it can take almost as long as the original computation for that partition.

The second, and worse, is width. A narrow dependency replays cheaply: one output partition needs one input partition, so recomputing it reads one upstream piece. A wide dependency is different. Because a shuffled partition was built from many input partitions across the cluster, recomputing it means re-reading and re-shuffling all of those inputs. A lost partition on the far side of a shuffle can force the recomputation of a large fan-in of upstream data, which is dramatically more expensive than replaying a narrow chain.

Lineage shape	What recovery costs	The risk
Short and narrow	Replay one short recipe per lost partition	Cheap; fault tolerance is free
Very long (iterative)	Replay the entire chain from source	Recovery rivals the original run
Wide (post-shuffle)	Re-read and re-shuffle many input partitions	One lost piece pulls in a large fan-in

There is a compounding effect that makes this sharper in practice. A long pipeline tends to accumulate wide operations as it grows, so the two costs stack: the lineage is both long AND punctuated by shuffles, and a failure late in the job can force a replay that walks back through several expensive reorganisations. This is why iterative machine-learning jobs, which loop the same wide-heavy chain dozens of times, were the original motivation for checkpointing in Spark. Without it, a single lost executor near iteration forty could trigger a replay reaching all the way back toward iteration one.

So fault tolerance is not actually free. It stays cheap only while your lineage is short and narrow. When a pipeline grows long or leans heavily on wide operations, the cost of a single failure climbs, and at some point the right move is to stop trusting recomputation and materialise a durable checkpoint. The next sections are about knowing where that point is.

There is also a quieter failure mode than slowness, and it bites iterative jobs hardest: the lineage itself can grow large enough to become a problem on the driver. Every transformation adds nodes to the logical plan that the driver holds in memory and reasons about. In a loop that extends the same DataFrame thousands of times, the plan can balloon until planning each new action grows slow, or in extreme cases the recursive plan walk overflows the stack. So a very long lineage costs you twice: expensive recovery if a failure happens, and expensive planning even if none does. Both pressures point at the same remedy, periodically cutting the lineage so neither the recovery path nor the plan tree is allowed to grow without bound.

Checkpointing: Cutting the Lineage

Daily Life

Interviews

Checkpointing is the deliberate act of truncating a lineage. When you checkpoint a DataFrame, Spark computes it and writes the result to reliable storage, then discards the lineage behind it. From that point on, the checkpointed data is a new starting point with no history: if a downstream partition is lost, Spark recovers it by reading the checkpoint, not by replaying the entire chain that produced it. You have traded the recompute cost for a one-time write cost.

This matters most for the two expensive cases from the last section. For a very long iterative lineage, checkpointing every so often caps how far back any recovery has to replay: instead of going to the original source, it goes to the most recent checkpoint. For a pipeline that has already done expensive wide work, checkpointing after the shuffle means a later failure does not force the shuffle to be redone. You are saving the expensive-to-rebuild result so you never have to rebuild it.

	Eager checkpoint	Lazy checkpoint
When it writes	Immediately, as a separate action	At the next action that needs the data
Extra pass over data	Yes, one dedicated computation	Folded into the next action's run
When to prefer	You want the cut materialised now	You want to avoid a separate pass

Checkpointing comes in eager and lazy forms. Eager checkpointing triggers its own computation immediately, paying for an extra pass over the data right then. Lazy checkpointing waits and folds the write into the next action that runs the chain, avoiding a dedicated pass but delaying when the cut actually exists. The choice is a small optimisation; the important decision is the earlier one, whether to checkpoint at all, and that comes down to whether a recovery would be expensive enough to justify a durable write now.

TIP

How to frame it in an interview: checkpointing converts an expensive recomputation into a cheap read by paying a write up front. You checkpoint when the lineage behind a result has gotten long enough, or wide enough, that replaying it after a failure would cost more than storing the result would. It is a deliberate cost trade, not a default you sprinkle everywhere.

The expensive result worth materialising

The aggregation below sits behind a wide groupBy, so its lineage is the expensive kind: replaying a lost partition re-does the shuffle. This is the natural candidate for a checkpoint in a long pipeline, the result you would store so a later failure reads it instead of rebuilding it. Complete the wide operation and the aggregation that make this result costly to recompute.

> From order_items, compute total revenue (quantity times unit_price) per product_id, highest first. The grouping is the wide step whose result you might checkpoint; the aggregation is what makes it worth materialising.

(order_items
   .___("product_id")
   .agg(F.___(F.col("quantity") * F.col("unit_price")).alias("revenue"))
   .orderBy(F.col("revenue").desc()))

groupBy

sum

count

avg

That groupBy is a wide dependency, so the revenue result it produces is the kind you would checkpoint at a cut point in a longer job: storing it once turns any later recovery into a read of durable storage rather than a replay of the shuffle. The decision always takes this shape: look at what sits behind a result, and if it is a long or wide lineage, materialise it before it costs you twice.

A word on where the durable storage actually is, because it is the difference between a real checkpoint and a fragile one. A true checkpoint writes to reliable, replicated storage like HDFS or a cloud object store, somewhere that survives any single machine failing. That is what lets Spark throw the lineage away safely: the stored result cannot vanish with one dead executor the way local shuffle files can. There is a local-disk variant that is faster to write but does not have this guarantee, and treating it as a real cut is a trap, because if the disk holding it is lost you have neither the result nor the lineage you discarded. When you checkpoint to break a lineage you depend on, write to the replicated store; the whole point is durability, and durability is the part you cannot fake with a faster local write.

Cache vs Checkpoint vs Persist: Which Solves What

Daily Life

Interviews

Three operations get confused because they all hold onto a result, though they solve different problems. The confusion is understandable: cache and persist both keep a result around for reuse, and checkpoint also writes a result to storage. What separates them is what they are FOR, and specifically whether they cut the lineage.

cache and persist are about speed of reuse. They keep a computed result in memory, or memory and disk, so that the next action reusing it does not recompute the chain, exactly the re-execution problem from the beginner tier. The key point is that they do NOT cut the lineage. A cached result is held for convenience, and if the cached blocks are evicted under memory pressure or lost with a dead executor, Spark falls back to the lineage and recomputes them. Cache speeds up the happy path; it does not make a result durable.

•cache / persist (speed of reuse)

Keeps the result in memory/disk
Avoids recomputing on reuse
Does NOT cut the lineage
Evicted or lost blocks recompute from lineage

•checkpoint (recovery cost)

Writes the result to reliable storage
Cuts the lineage behind it
Survives executor death
Recovery reads the checkpoint, no replay

checkpoint is about recovery cost. It writes to reliable storage and truncates the lineage, so the result survives executor death and a later failure reads it back instead of replaying history. The price is that checkpointing always pays a full write to durable storage, whereas caching to memory is cheaper but weaker. The clean mental model: persist when you will reuse a result several times and want it fast; checkpoint when the lineage behind a result has grown expensive enough that you want a failure to read it rather than rebuild it.

In practice the two compose. A common pattern in long iterative jobs is to both cache and checkpoint at a cut point: cache so the immediate reuse is fast from memory, and checkpoint so the lineage is truncated and a failure does not replay the whole history. They are not alternatives so much as tools for two different costs, the cost of reuse and the cost of recovery, and a mature pipeline reasons about both.

Interviewers probe the ordering of those two operations, and getting it right shows you understand why each exists. If you checkpoint without caching first, the checkpoint write runs the chain to produce the result, then the next action that reads the checkpointed data runs again from the checkpoint, which is fine but means the expensive upstream chain ran for the checkpoint. If you cache first and then checkpoint, the cached in-memory copy can serve both the checkpoint write and the subsequent reuse, so the expensive chain runs once and feeds both. The principle underneath is the same one from the beginner tier: a chain recomputes every time it is triggered unless something holds the result, so when you are deliberately materialising, arrange the calls so the costly part runs the fewest times.

Determinism: Why Recompute-Based Recovery Can Break

Daily Life

Interviews

Lineage-based recovery rests on an assumption so quiet it is easy to miss, and violating it produces some of the most baffling bugs in Spark: recomputation assumes that replaying a transformation produces the same result it did the first time. If a transformation is non-deterministic, that assumption fails, and recovery can silently produce different data than the partition it is meant to replace.

Take a transformation that assigns a random value, or one that depends on the current time, or one whose result depends on the order rows happen to arrive in. The first time it runs, it produces one set of values. If a partition is lost and Spark replays the lineage to rebuild it, the non-deterministic step runs again and produces DIFFERENT values. Now the rebuilt partition does not match what the rest of the job already consumed from the original. The job does not crash; it produces subtly wrong results, which is far worse than a crash because nothing tells you.

Non-deterministic source	What breaks on replay	The fix
Random number generation	Different values on recompute	Seed it, or materialise before reuse
Current time / timestamps	A different 'now' each replay	Capture the time once, pass it as a value
Order-dependent logic	Different result if input order shifts	Make the logic order-independent

The defences are straightforward once you know to look for them. Seed any randomness so it reproduces. Capture a single timestamp at the start and thread it through as a fixed value rather than calling for the current time inside a transformation. And where a result genuinely cannot be made deterministic, materialise it, with a checkpoint or a write, so it becomes stored data that is read back rather than a recipe that is replayed. Materialising removes the recompute, and no recompute means no chance of a divergent replay.

A subtle non-determinism that catches even experienced engineers hides in monotonically increasing ids and anything that depends on partition order. An id assigned by row position is stable only as long as the partitions and their order are stable, and a replay that rebuilds one partition can land rows in a different arrangement, handing the same logical row a different id. The danger is that this looks deterministic in testing, because nothing fails until a real recovery happens in production and a handful of ids silently shift. The rule of thumb: if a column's value would change when the data is split or ordered differently, it is non-deterministic for the purposes of recovery, even if it never calls a random function. Treat any such column as a result to materialise, not a recipe to replay.

✓Do

Treat fault tolerance as cheap only while lineage stays short and narrow.
Checkpoint to cut a lineage that has grown long or sits past expensive shuffles.
Use cache/persist for reuse speed and checkpoint for recovery cost; compose them when needed.
Seed randomness and capture timestamps once so a replay reproduces the original result.

✗Don't

Don't assume recovery is free; a wide or very long lineage is expensive to replay.
Don't expect cache to survive executor death; it does not cut the lineage.
Don't leave non-deterministic transforms in a chain that may be recomputed; replays diverge.
Don't checkpoint everywhere; it pays a full durable write, so use it where recovery would cost more.

❯❯❯PUTTING IT ALL TOGETHER

> You own a long iterative Spark job that refines a model over many passes, and it occasionally loses an executor near the end of a multi-hour run. Recovery has been taking almost as long as the original computation, and one run produced numbers that did not reproduce.

The slow recovery is the long-lineage cost: a partition lost near the end replays the entire iterative chain from source.

You checkpoint periodically so any recovery only has to replay back to the most recent checkpoint, not to the beginning.

You also cache the working set so each iteration reuses it from memory, composing speed-of-reuse with the lineage cut.

The non-reproducing run points at a non-deterministic step; you seed the randomness and capture the run timestamp once so a replay rebuilds identical data.

KEY TAKEAWAYS

Lineage records how each partition was built, so Spark recomputes a lost one instead of restarting.

Recovery is cheap for short narrow lineages and expensive for long ones or wide post-shuffle dependencies.

Checkpointing writes a result to reliable storage and cuts the lineage, trading recompute cost for a write.

cache/persist speed up reuse but do not cut lineage; checkpoint cuts lineage and survives executor death.

Non-deterministic transforms break recompute-based recovery; seed randomness, fix timestamps, or materialise.

A partition is never data Spark trusts to survive. It is a recipe Spark can rebuild.

Category: Spark
Difficulty: advanced
Duration: 15 minutes
Challenges: 2 hands-on challenges

Topics covered: Lineage-Based Recovery: Rebuild, Don't Re-Read, The Recompute Cost: When Lineage Gets Expensive, Checkpointing: Cutting the Lineage, Cache vs Checkpoint vs Persist: Which Solves What, Determinism: Why Recompute-Based Recovery Can Break

Lesson Sections

Lineage-Based Recovery: Rebuild, Don't Re-Read
When an executor dies mid-job, it takes its partitions of in-flight data with it. A naive system would have to start over, because that data is gone. Spark does not, and the reason is that it never treated those partitions as precious irreplaceable data in the first place. It treated them as the output of a known recipe. The lineage is that recipe, recorded per partition, and recovery is just running the recipe again for the partitions that were lost. Say a partition was produced by reading a ch
The Recompute Cost: When Lineage Gets Expensive
Recovery is cheap when the lineage is short and narrow, because replaying it touches little data and moves none across the network. It gets expensive in two ways, and once you can recognise them you can engineer for the failure instead of trusting fault tolerance blindly. The first is length. Every transformation you chain adds a step to the lineage of the partitions it produces. A pipeline with hundreds of transformations, common in iterative algorithms that loop and refine, builds a very long
Checkpointing: Cutting the Lineage
Checkpointing is the deliberate act of truncating a lineage. When you checkpoint a DataFrame, Spark computes it and writes the result to reliable storage, then discards the lineage behind it. From that point on, the checkpointed data is a new starting point with no history: if a downstream partition is lost, Spark recovers it by reading the checkpoint, not by replaying the entire chain that produced it. You have traded the recompute cost for a one-time write cost. This matters most for the two e
Cache vs Checkpoint vs Persist: Which Solves What
Three operations get confused because they all hold onto a result, though they solve different problems. The confusion is understandable: cache and persist both keep a result around for reuse, and checkpoint also writes a result to storage. What separates them is what they are FOR, and specifically whether they cut the lineage. cache and persist are about speed of reuse. They keep a computed result in memory, or memory and disk, so that the next action reusing it does not recompute the chain, ex
Determinism: Why Recompute-Based Recovery Can Break
Lineage-based recovery rests on an assumption so quiet it is easy to miss, and violating it produces some of the most baffling bugs in Spark: recomputation assumes that replaying a transformation produces the same result it did the first time. If a transformation is non-deterministic, that assumption fails, and recovery can silently produce different data than the partition it is meant to replace. Take a transformation that assigns a random value, or one that depends on the current time, or one