Tungsten: Performance as a Hardware Problem

Catalyst decides what plan to run; Tungsten decides how fast that plan runs on real hardware. The engineers who squeeze the last factor of two out of Spark spend their time on CPU caches and memory layout once the query logic is settled. You can already read a physical plan and trust the optimizer's choices. This tier goes beneath the plan to the execution engine: off-heap memory that dodges garbage collection, a binary row format built for the CPU, and code generation that fuses a whole stage into one compiled loop. Once the plan is fixed, speed becomes a hardware problem, and Tungsten is how Spark solves it.

Off-Heap Memory: Dodging the Garbage Collector

Daily Life

Interviews

Spark runs on the JVM, and the JVM's garbage collection is a convenience at small scale and a liability at Spark's. Every object you create on the JVM heap is tracked and eventually collected, and when you have billions of small row objects, the garbage collector spends enormous effort tracking and freeing them. GC pauses can freeze an executor for seconds, and across a cluster those pauses add up to a serious tax on throughput. So Tungsten starts by getting the data out from under the garbage collector.

It does this with off-heap memory, allocated and managed directly through low-level operations rather than as ordinary JVM objects. Data stored off-heap is invisible to the garbage collector: it is not a tracked object, so it is never scanned or collected, and it imposes no GC overhead. Spark manages this memory itself, allocating and freeing it explicitly, the way a C program would. The result is that the bulk of Spark's data lives in a region the JVM never has to clean up, which removes GC as a bottleneck for the data-heavy parts of a job.

GC is a scale tax

Billions of small row objects on the JVM heap make the garbage collector a major cost; pauses freeze executors.

Off-heap is invisible to GC

Tungsten stores data in memory the JVM does not track, so it is never scanned or collected.

Spark manages it explicitly

Allocation and freeing are handled by Spark, not the JVM, trading convenience for control and speed.

This ties back to the memory-management lesson and to a real diagnostic signal. An executor spending a large fraction of its time in garbage collection, more than around ten percent being the usual alarm, has high on-heap object pressure, and part of Tungsten's job is to keep that pressure low by holding data off-heap. The off-heap design is why a well-configured Spark job processes far more data per executor than naive JVM object handling would allow: it sidestepped the collector that would otherwise have throttled it.

It helps to be concrete about why billions of small objects are so toxic to a generational collector. The JVM collects young objects cheaply but pays dearly for objects that survive long enough to be promoted to the old generation, because collecting that region can require scanning a large live set. A Spark stage that holds millions of row objects in a hash aggregate keeps them alive across the whole stage, so they get promoted, and then the old-generation collection that eventually runs has to walk all of them. By keeping that data in off-heap UnsafeRows, Tungsten leaves the young generation nearly empty of row objects, so the collector has almost nothing to promote and the expensive old-generation pauses largely stop happening. A faster collector would not help; giving the collector far less to do is the fix.

This speed costs you something, and the cost is control in place of safety. Off-heap memory is not protected by the JVM's automatic management, so Spark has to track and free it correctly itself, and misconfiguration can lead to memory that the container counts against you even though the heap looks fine. This is the source of some of the more confusing out-of-memory errors, where the heap has room but the container is killed for off-heap overhead. Tungsten's performance is real, but it moves some memory responsibility from the JVM onto Spark's own bookkeeping.

UnsafeRow: A Binary Format Built for Speed

Daily Life

Interviews

Storing data off-heap only helps if the data is stored compactly, and Tungsten's row format, the UnsafeRow, is designed for exactly that. A normal JVM representation of a row is a tree of objects: an object for the row, objects for its fields, pointers between them, plus the per-object header overhead the JVM adds to everything. For a billion rows, that overhead and pointer-chasing is enormous. UnsafeRow throws it away.

An UnsafeRow stores a row as a single contiguous block of bytes in a fixed binary layout: the fields packed together in a known order, with offsets rather than pointers, and no per-object headers. This is dramatically more compact than the object representation, often several times smaller, which means more rows fit in memory and less data moves during a shuffle. It is also cheaper to serialize, because the on-wire format is close to the in-memory format, so encoding a row for a shuffle is nearly a memory copy instead of a full object-graph traversal.

⚠JVM object row (the old way)

A tree of objects and pointers
Per-object header overhead on everything
Pointer-chasing to read a field
Expensive to serialize for a shuffle

•Tungsten UnsafeRow

One contiguous block of bytes
No per-object headers; packed fields
Read a field by a known offset
Serialization is close to a memory copy

This is also why Tungsten is faster than the older Kryo and Java serialization that RDDs relied on. Those serializers had to turn a graph of JVM objects into bytes and back, which is inherently expensive. The UnsafeRow is already a flat byte layout, so moving it is cheap, and the engine can operate on the bytes directly without rehydrating objects. When you shuffle DataFrames, you are moving compact UnsafeRows; when you shuffled RDDs of objects, you were moving serialized object graphs, which is a large part of why DataFrames outperform RDDs even for the same logic.

The format pays off most where data is densest: aggregations and joins that process huge numbers of rows. Because a row is a flat byte block, the engine can compute a hash of a join key or compare two rows by reading bytes at known offsets, with no object allocation per row. Eliminating per-row allocation is half the speed story, because allocation is what created the GC pressure in the first place. The compact format and the off-heap storage are two halves of the same idea: keep the data small, flat, and out of the collector's way.

The layout itself is worth a closer look. An UnsafeRow splits into a fixed-length region and a variable-length region. Every field gets a slot in the fixed region: numbers and other fixed-width types sit there directly, while a variable-width value like a string stores an offset and a length that point into the variable region appended after the fixed part. A small null-tracking bitset at the front records which fields are null. The payoff is that reading the third integer of a row is arithmetic, jump to a known byte offset and read eight bytes, with no traversal and no decoding. That predictable addressing is what lets the generated code in the next section treat a row as raw memory and stay fast.

Whole-Stage Code Generation: One Loop for the Stage

Daily Life

Interviews

The third pillar of Tungsten is the one whose fingerprints you already saw in the physical plan: whole-stage code generation. Without it, executing a plan means walking a tree of operators and calling a generic method for each operator on each row, with the overhead of a virtual function call and intermediate results passed between operators. For a billion rows through five operators, that is five billion method calls, and the call overhead alone dominates the actual work.

Whole-stage code generation collapses this. For a run of operators in a single stage, Tungsten generates a single piece of Java code that does the work of all of them fused together, then compiles it on the fly. The filter, the projection, the partial aggregation become one tight loop over the rows, with no per-operator method calls and no intermediate results materialised between steps. A row enters the loop, flows through all the fused logic, and exits, all in code the JIT compiler can optimise like hand-written code. This is why the physical plan marks fused operators with the *(n) notation: it is telling you which operators became one generated function.

five operators, one loop

what codegen fuses

The performance gap is large. Fused, compiled code can run an order of magnitude faster than the interpreted operator-tree walk it replaces, because it removes the function-call overhead, keeps intermediate values in CPU registers instead of memory, and lets the JIT compiler apply its full toolkit to a tight loop. This is why modern Spark runs the same DataFrame query so much faster than early Spark did: the same logical plan now runs as compiled, fused code instead of an interpreted tree.

The model this replaced is called the Volcano or iterator model. In that model every operator implements a next method, and producing one output row means the top operator calls next on its child, which calls next on its child, all the way down to the scan, then the rows bubble back up through a virtual call at each level. It is elegant and modular, and for a billion rows through five operators it costs five billion virtual calls plus the boxing of values between stages. Whole-stage codegen keeps the same operator semantics but throws away the per-row call structure, inlining all five operators into one generated loop. You get the modularity of the operator tree at authoring time and the speed of hand-written code at runtime.

Where fusion happens, and where it stops

The query below is a narrow chain followed by an aggregation. The narrow operations fuse into one generated loop within their stage, which is what the codegen markers would show. Run it and picture the filter and the column computation running as a single compiled pass rather than separate interpreted steps.

(order_items
   .filter(F.col("unit_price") > 10)
   .withColumn("line_total", F.col("quantity") * F.col("unit_price"))
   .groupBy("product_id")
   .agg(F.sum("line_total").alias("revenue"))
   .orderBy(F.col("revenue").desc()))

The filter and the withColumn fuse into one generated loop; the aggregation after the shuffle is its own fused stage. Where fusion stops is informative: an opaque Python UDF or an RDD step cannot be generated into the loop, so it breaks the fusion and forces Spark back to the slower interpreted path around it. That is the deeper reason this lesson keeps warning against opaque functions: they block Catalyst's reasoning and Tungsten's code generation alike.

Cache Locality: Why the Flat Layout Is Fast

Daily Life

Interviews

One more reason Tungsten's design is fast operates a level below main memory: the CPU cache. Modern processors run far faster than main memory, so they keep recently-used data in small, very fast caches. Code runs fast when the data it needs is already in cache and slow when the CPU has to wait for main memory. How data is laid out in memory determines how often the CPU finds what it needs in cache, and Tungsten's flat binary layout is built to find it there.

JVM objects make a sharp contrast. A row represented as a tree of objects scatters its fields across the heap, connected by pointers, so reading a few fields of many rows means chasing pointers all over memory, and each pointer chase risks a cache miss that stalls the CPU. An UnsafeRow packs the fields contiguously, so the fields a computation needs sit next to each other and arrive in cache together. Processing a column across many rows reads a predictable, sequential stretch of memory, the access pattern caches are built to accelerate.

	JVM object rows	Tungsten flat rows
Memory layout	Scattered objects and pointers	Contiguous packed bytes
Reading a field	Pointer chase, risks cache miss	Sequential read, cache friendly
CPU behaviour	Stalls waiting on memory	Stays fed from fast cache
Net effect	Memory-bound, slower	Cache-efficient, faster

This cache friendliness compounds with the code generation. The fused, compiled loop processes rows in a tight sequence, and because the rows are laid out contiguously, the data the loop needs streams into cache just ahead of when the loop needs it. The compiled code and the flat layout are designed together: the loop is fast because it is compiled, and it stays fast because the data it reads is cache-resident. Each half depends on the other to reach full speed.

A hardware mechanism called prefetching turns the sequential layout into extra speed at no cost. Modern CPUs watch the addresses a loop touches, and when they detect a steady forward stride they fetch the next cache lines before the code asks for them, so the data is waiting in cache by the time the loop reaches it. A contiguous run of UnsafeRows is exactly the predictable stride the prefetcher is built to recognise, so processing millions of rows rarely stalls on memory. A tree of scattered objects defeats the prefetcher entirely: each pointer leads somewhere unpredictable, the hardware cannot guess the next address, and the CPU stalls on a cache miss again and again. Beyond fitting in cache, the flat layout makes the cache fill itself ahead of the work.

Columnar caching extends the same idea one step further. When you call cache on a DataFrame, Spark does not store it as rows; it stores it in a compressed columnar form in memory, so a later scan that needs only category and price reads two tight columns instead of stepping over whole rows. Columns of like-typed values also compress well and stream through the CPU in exactly the sequential pattern the cache rewards. So the cache-locality advantage that the flat row layout gives on a single pass is amplified when you persist data and read it many times: each read gets the cache-friendly access pattern again, which is part of why caching a reused intermediate is so effective.

The upshot is that you never tune cache locality directly; you get it by staying in the DataFrame world so Tungsten's format and codegen apply. The declarative, relational style is cleaner to write, and it also unlocks a whole stack of hardware-level optimizations, from off-heap storage to compact rows to code generation to cache locality, that opaque RDD code and UDFs cannot reach. Staying declarative is how you collect all of it.

When the Optimizer Guesses Wrong

Daily Life

Interviews

For all its sophistication, the optimizer can produce a bad plan, and knowing when and why matters because the fix usually means giving it better information or an explicit hint rather than fighting it. Almost every optimizer mistake traces back to the same root: a decision made from an estimate that turned out to be wrong.

The most common cause is stale or missing statistics, the issue the intermediate tier introduced. The optimizer chooses a join strategy from estimated sizes, so if a table grew tenfold since it was last analyzed, Catalyst plans for the old size and may choose a sort-merge join where a broadcast would now be wrong, or worse, try to broadcast a side that is no longer small and overwhelm memory. Bad cardinality estimates, underestimating how many rows a filter or join will produce, lead to undersized shuffles and spill. The plan looks reasonable; it was built for a reality that no longer holds.

Data skew is the failure mode that survives even perfect statistics, and it is worth recognising on its own. Suppose order_items is joined on a key where one value, a placeholder product_id or a single dominant customer, owns a huge share of the rows. Average statistics say the join is balanced, and the plan is technically correct, but at runtime one partition receives that whole skewed key and its task runs for minutes while every other task finished in seconds. The signature in the UI is unmistakable: one straggler task with a far larger input than its peers. Better statistics will not help, because the totals were never wrong; you fix it by splitting the hot key, often with a salting technique or by letting adaptive execution split skewed partitions automatically.

The optimizer mistake	The usual cause	The fix
Wrong join strategy	Stale or missing table statistics	ANALYZE TABLE, or a broadcast hint
Broadcast that is too large	Underestimated side size	Refresh stats; lower the broadcast threshold
Undersized shuffle, spill	Bad cardinality estimate	Size shuffle partitions; refresh stats
Fusion broke unexpectedly	An opaque UDF or RDD step	Replace with a built-in or SQL expression

The defences form a hierarchy. First, keep statistics fresh, because good information prevents most bad decisions at the source. Second, when you know something the optimizer cannot, use a hint: a broadcast hint tells Catalyst to broadcast a side regardless of its size estimate, which is the right tool when you know a side is small but the statistics do not. Third, where adaptive query execution is available, it re-decides some of these choices at runtime using real sizes observed mid-query, catching estimate errors the static plan made. And throughout, keep your chain declarative so codegen and the optimizer keep working at all.

Override a bad guess with a broadcast hint

When you know a side is small but the statistics do not, you override the optimizer with a broadcast hint, forcing the strategy you know is right. The fill-in below joins order_items to products and hints that products should be broadcast, regardless of what the stale statistics estimate. Supply the broadcast wrapper and the aggregation.

> Join order_items to products, forcing a broadcast of the small products side so a stale size estimate cannot push Catalyst into a sort-merge, and return total revenue (quantity times unit_price) per category, highest first.

(order_items
   .join(___(products), "product_id")
   .groupBy("category")
   .agg(F.___(F.col("quantity") * F.col("unit_price")).alias("revenue"))
   .orderBy(F.col("revenue").desc()))

broadcast

sum

repartition

count

The broadcast hint tells Catalyst what you know and its statistics do not: that products is small enough to ship everywhere. It corrects the optimizer's information without touching its mechanics, and it reverses itself the moment fresh statistics make the hint unnecessary.

Treat the optimizer as a powerful collaborator that occasionally needs correction, neither an oracle to trust blindly nor a nuisance to override everywhere. You read the plan to see its decisions, you recognise the signature of a bad guess, a sort-merge where you expected a broadcast, a spill where you expected none, and you correct the input to its reasoning instead of abandoning it. The engineer who works with Spark rather than fighting it knows that the optimizer's mistakes are almost always information problems, and fixes the information.

✓Do

Stay in the DataFrame API so off-heap storage, UnsafeRow, codegen, and cache locality all apply.
Keep table statistics fresh so the optimizer's size-based choices are correct.
Use a broadcast hint when you know a side is small but the statistics do not.
Read the plan for the signature of a bad guess: an unexpected sort-merge or a surprise spill.

✗Don't

Don't break Tungsten codegen with an opaque UDF or RDD step inside a hot stage.
Don't trust a plan built on stale statistics; refresh them before blaming the optimizer.
Don't fight a bad plan by hand-tuning; fix the information (stats, hints) it reasons from.
Don't ignore high GC time; it signals on-heap pressure Tungsten's off-heap design is meant to relieve.

❯❯❯PUTTING IT ALL TOGETHER

> A job that used to be fast now spends a third of its time in garbage collection and has slowed sharply, and the physical plan shows a Python UDF in the middle of what was a fully code-generated stage. The tables were also last analyzed before the data tripled.

The high GC time points at on-heap object pressure, the exact problem Tungsten's off-heap UnsafeRow design exists to relieve.

The Python UDF in the hot stage breaks whole-stage code generation, forcing the slower interpreted path and the per-row JVM-to-Python crossing.

You replace the UDF with a built-in or SQL expression so the stage fuses back into one compiled loop and stays in the fast format.

You run ANALYZE TABLE so the tripled data's real size drives the join strategy, and confirm in explain that codegen is unbroken and the join is right.

KEY TAKEAWAYS

Tungsten stores data off-heap to dodge the garbage collector that billions of row objects would overwhelm.

The UnsafeRow packs a row into one flat byte block: compact, cheap to serialize, faster than Kryo or Java.

Whole-stage code generation fuses a stage's operators into one compiled loop, an order of magnitude faster.

The flat layout is CPU-cache friendly and compounds with codegen; the declarative API is what turns it on.

The optimizer guesses wrong from stale stats or opaque functions; fix the information with ANALYZE, hints, and built-ins.

Once the plan is chosen, speed is about memory, the CPU cache, and compiled loops.

Category: Spark
Difficulty: advanced
Duration: 15 minutes
Challenges: 2 hands-on challenges

Topics covered: Off-Heap Memory: Dodging the Garbage Collector, UnsafeRow: A Binary Format Built for Speed, Whole-Stage Code Generation: One Loop for the Stage, Cache Locality: Why the Flat Layout Is Fast, When the Optimizer Guesses Wrong

Lesson Sections

Off-Heap Memory: Dodging the Garbage Collector
Spark runs on the JVM, and the JVM's garbage collection is a convenience at small scale and a liability at Spark's. Every object you create on the JVM heap is tracked and eventually collected, and when you have billions of small row objects, the garbage collector spends enormous effort tracking and freeing them. GC pauses can freeze an executor for seconds, and across a cluster those pauses add up to a serious tax on throughput. So Tungsten starts by getting the data out from under the garbage c
UnsafeRow: A Binary Format Built for Speed
Storing data off-heap only helps if the data is stored compactly, and Tungsten's row format, the UnsafeRow, is designed for exactly that. A normal JVM representation of a row is a tree of objects: an object for the row, objects for its fields, pointers between them, plus the per-object header overhead the JVM adds to everything. For a billion rows, that overhead and pointer-chasing is enormous. UnsafeRow throws it away. An UnsafeRow stores a row as a single contiguous block of bytes in a fixed b
Whole-Stage Code Generation: One Loop for the Stage
The third pillar of Tungsten is the one whose fingerprints you already saw in the physical plan: whole-stage code generation. Without it, executing a plan means walking a tree of operators and calling a generic method for each operator on each row, with the overhead of a virtual function call and intermediate results passed between operators. For a billion rows through five operators, that is five billion method calls, and the call overhead alone dominates the actual work. Whole-stage code gener
Cache Locality: Why the Flat Layout Is Fast
One more reason Tungsten's design is fast operates a level below main memory: the CPU cache. Modern processors run far faster than main memory, so they keep recently-used data in small, very fast caches. Code runs fast when the data it needs is already in cache and slow when the CPU has to wait for main memory. How data is laid out in memory determines how often the CPU finds what it needs in cache, and Tungsten's flat binary layout is built to find it there. JVM objects make a sharp contrast. A
When the Optimizer Guesses Wrong
For all its sophistication, the optimizer can produce a bad plan, and knowing when and why matters because the fix usually means giving it better information or an explicit hint rather than fighting it. Almost every optimizer mistake traces back to the same root: a decision made from an estimate that turned out to be wrong. The most common cause is stale or missing statistics, the issue the intermediate tier introduced. The optimizer chooses a join strategy from estimated sizes, so if a table gr