Spark optimization interview questions for senior data engineer roles. Skew handling with salt-and-rebalance. AQE override scenarios. Broadcast versus sort-merge join decisions. Partition pruning with proper WHERE clauses. Spark UI deep dives to identify hot tasks. The optimization-round questions at L5+ Spark-first data engineer interviews.

Spark optimization rounds at senior data engineer interviews are typically structured as: interviewer hands you a Spark job (DataFrame code or SQL plus an EXPLAIN plan or a Spark UI screenshot), says "this is slower than expected, what would you change". The data engineer's job is to identify the cause from the artifact alone and propose a fix. Five recurring causes account for 80 percent of optimization-round questions in 2026.

Skew on a join key. One key has 10x+ the rows of the median. Symptom in Spark UI: one task at 8x median duration with 10x median shuffle read. Fix: salt the hot key with mod-N suffix on both sides, join, aggregate by salted key, unsalt and re-aggregate. AQE in Spark 3.0+ does this automatically for some skew patterns; for skew at stage boundaries AQE cannot identify, the data engineer applies the fix manually.

Function in WHERE preventing partition pruning. WHERE DATE(event_ts) = '2026-05-27' on a date-partitioned table cannot use the partition pruner because Catalyst cannot reason about DATE() in reverse. Symptom: full table scan in EXPLAIN, no PartitionFilters listed. Fix: WHERE event_ts >= '2026-05-27' AND event_ts < '2026-05-28'. Same pattern for any function wrapping the partition column.

Broadcast join not happening when it should. Both sides large in Catalyst's estimate but one side actually fits in driver memory. Symptom: SortMergeJoin in EXPLAIN where you expected BroadcastHashJoin. Cause: stats out of date (ANALYZE TABLE has not run), or the side is just over the autoBroadcastJoinThreshold (default 10MB, often too low). Fix: ANALYZE TABLE to update stats, raise the threshold to 100MB+, or apply /*+ BROADCAST(table) */ hint.

Wrong shuffle partition count. spark.sql.shuffle.partitions default 200 is often wrong: too few for very large data (each partition too big, executor OOM), too many for small data (overhead of many small tasks). Symptom: very few large tasks (under-parallelism) or very many short tasks (overhead dominates). Fix: tune to target 100-200 MB per partition; AQE coalesces small post-shuffle partitions automatically.

Memory pressure with spill. Executor heap too small for the working set; Spark spills to disk; query becomes IO-bound. Symptom: spill metrics greater than 0 in Spark UI, GC time greater than 10 percent. Fix: increase executor memory (spark.executor.memory) or reduce partition size (more partitions). Avoid collect() and other actions that pull data to the driver.

The Spark UI is the optimization-round artifact. Senior data engineer interview questions present a screenshot and ask for diagnosis. Summary Metrics row shows min, 25th, 50th, 75th, max of task duration, GC time, shuffle read, shuffle write, spill memory. Max 10x median equals skew. Spill greater than 0 equals memory pressure. GC time greater than 10 percent equals GC pressure. The Tasks table sorted descending by duration shows the slow partitions. The Storage tab shows cached DataFrames and their RDD partitions.

Companies whose data engineer interviews emphasize Spark optimization heavily: Databricks (Spark itself), Netflix (Spark UI screenshots are a recurring question type), Uber (large-scale optimization at scale), Airbnb. Each can have a 30-60 minute round dedicated to optimization, especially at L5+ where the rubric expects EXPLAIN reading and Spark UI deep dives.

Spark Optimization Interview Questions

Spark performance tuning questions for senior data engineer interview prep.

Common questions

How does a data engineer identify Spark skew?
Open the Spark UI, Stages tab, click the slow stage, look at the Tasks table sorted by duration descending. Skew shows as one or two tasks at 5-10x median duration with corresponding 5-10x shuffle read or write. The Summary Metrics row at the top: max duration 10x the median is the smoking gun.
What is the salt-and-rebalance technique for Spark skew?
Append a mod-N suffix to the hot key on both sides (typically N=8 to 32). The small side has its rows replicated N times (one per suffix value); the large side has the suffix appended to existing rows. Joining on the salted key spreads the formerly-hot key across N partitions. Aggregate by salted key, unsalt, re-aggregate. Trade-off: N-fold replication of small side.
When does AQE help and when does a data engineer need to override?
AQE in Spark 3.0+ helps with skew detected at the post-shuffle stage (split skewed partitions), broadcast threshold adjustment at runtime, and partition coalescing after shuffle. AQE cannot help with skew that occurs at a stage boundary AQE does not optimize, or when the data engineer knows broadcast is correct but Catalyst's stats are stale.
Why does a function in WHERE prevent partition pruning?
Catalyst cannot reason about the function in reverse. WHERE DATE(event_ts) = '2026-05-27' requires Catalyst to invert DATE() to identify the partitions; it cannot. Rewrite as WHERE event_ts >= '2026-05-27' AND event_ts < '2026-05-28' so Catalyst sees a direct comparison on the partition column. Same applies to UPPER(name), CAST(id AS string), and user-defined functions.
How does a data engineer pick the right number of Spark shuffle partitions?
Target 100-200 MB per partition. For 100 GB shuffle, 500-1000 partitions. spark.sql.shuffle.partitions default 200 is often wrong for very small or very large data. AQE in Spark 3.0+ coalesces small post-shuffle partitions automatically; for very large shuffles, set the partition count explicitly.
What does spill in the Spark UI mean and how is it fixed?
Executor heap too small for the working set; Spark spills intermediate data to disk; query becomes IO-bound. Symptom: spill metrics greater than 0. Fix: increase executor memory (spark.executor.memory), reduce partition size (more partitions for smaller working set), or avoid actions that pull data to the driver.
When should a data engineer cache or checkpoint a DataFrame?
Cache when reusing in multiple actions in the same job (df.cache(), df.unpersist() when done). Checkpoint when the lineage is so long (10+ wide transformations) that recovery on executor failure would be expensive (df.checkpoint() writes to durable storage and truncates lineage). For most jobs, neither is needed; Spark handles re-computation efficiently.
How does the EXPLAIN ANALYZE plan reveal Spark optimization opportunities?
Look for SortMergeJoin where BroadcastHashJoin would be faster (stats out of date or threshold too low). Look for full table scans (PartitionFilters absent) when a partition filter was intended (function in WHERE). Look for Exchange (shuffle) where it could be avoided. Look for high Filter selectivity (predicate not pushed to source).