Q: How does a data engineer pick the right number of Spark shuffle partitions?

Target 100-200 MB per partition. For 100 GB shuffle, 500-1000 partitions. spark.sql.shuffle.partitions default 200 is often wrong for very small or very large data. AQE in Spark 3.0+ coalesces small post-shuffle partitions automatically; for very large shuffles, set the partition count explicitly.

Q: What does spill in the Spark UI mean and how is it fixed?

Executor heap too small for the working set; Spark spills intermediate data to disk; query becomes IO-bound. Symptom: spill metrics greater than 0. Fix: increase executor memory (spark.executor.memory), reduce partition size (more partitions for smaller working set), or avoid actions that pull data to the driver.

Q: When should a data engineer cache or checkpoint a DataFrame?

Cache when reusing in multiple actions in the same job (df.cache(), df.unpersist() when done). Checkpoint when the lineage is so long (10+ wide transformations) that recovery on executor failure would be expensive (df.checkpoint() writes to durable storage and truncates lineage). For most jobs, neither is needed; Spark handles re-computation efficiently.

Q: How does the EXPLAIN ANALYZE plan reveal Spark optimization opportunities?

Look for SortMergeJoin where BroadcastHashJoin would be faster (stats out of date or threshold too low). Look for full table scans (PartitionFilters absent) when a partition filter was intended (function in WHERE). Look for Exchange (shuffle) where it could be avoided. Look for high Filter selectivity (predicate not pushed to source).

Question 1

How does a data engineer identify Spark skew?

Accepted Answer

Open the Spark UI, Stages tab, click the slow stage, look at the Tasks table sorted by duration descending. Skew shows as one or two tasks at 5-10x median duration with corresponding 5-10x shuffle read or write. The Summary Metrics row at the top: max duration 10x the median is the smoking gun.

Question 2

What is the salt-and-rebalance technique for Spark skew?

Accepted Answer

Append a mod-N suffix to the hot key on both sides (typically N=8 to 32). The small side has its rows replicated N times (one per suffix value); the large side has the suffix appended to existing rows. Joining on the salted key spreads the formerly-hot key across N partitions. Aggregate by salted key, unsalt, re-aggregate. Trade-off: N-fold replication of small side.

Question 3

When does AQE help and when does a data engineer need to override?

Accepted Answer

AQE in Spark 3.0+ helps with skew detected at the post-shuffle stage (split skewed partitions), broadcast threshold adjustment at runtime, and partition coalescing after shuffle. AQE cannot help with skew that occurs at a stage boundary AQE does not optimize, or when the data engineer knows broadcast is correct but Catalyst's stats are stale.

Question 4

Why does a function in WHERE prevent partition pruning?

Accepted Answer

Catalyst cannot reason about the function in reverse. WHERE DATE(event_ts) = '2026-05-27' requires Catalyst to invert DATE() to identify the partitions; it cannot. Rewrite as WHERE event_ts >= '2026-05-27' AND event_ts < '2026-05-28' so Catalyst sees a direct comparison on the partition column. Same applies to UPPER(name), CAST(id AS string), and user-defined functions.

Question 5

How does a data engineer pick the right number of Spark shuffle partitions?

Accepted Answer

Target 100-200 MB per partition. For 100 GB shuffle, 500-1000 partitions. spark.sql.shuffle.partitions default 200 is often wrong for very small or very large data. AQE in Spark 3.0+ coalesces small post-shuffle partitions automatically; for very large shuffles, set the partition count explicitly.

Question 6

What does spill in the Spark UI mean and how is it fixed?

Accepted Answer

Executor heap too small for the working set; Spark spills intermediate data to disk; query becomes IO-bound. Symptom: spill metrics greater than 0. Fix: increase executor memory (spark.executor.memory), reduce partition size (more partitions for smaller working set), or avoid actions that pull data to the driver.

Question 7

When should a data engineer cache or checkpoint a DataFrame?

Accepted Answer

Cache when reusing in multiple actions in the same job (df.cache(), df.unpersist() when done). Checkpoint when the lineage is so long (10+ wide transformations) that recovery on executor failure would be expensive (df.checkpoint() writes to durable storage and truncates lineage). For most jobs, neither is needed; Spark handles re-computation efficiently.

Question 8

How does the EXPLAIN ANALYZE plan reveal Spark optimization opportunities?

Accepted Answer

Look for SortMergeJoin where BroadcastHashJoin would be faster (stats out of date or threshold too low). Look for full table scans (PartitionFilters absent) when a partition filter was intended (function in WHERE). Look for Exchange (shuffle) where it could be avoided. Look for high Filter selectivity (predicate not pushed to source).

Spark Optimization Interview Questions

Spark Optimization Interview Questions

PySpark (10)