Apache Spark Interview Questions for L5–L7 (2026)

30 questions spanning Catalyst optimization, AQE skew handling, executor memory layout, structured streaming pitfalls, and production incident diagnosis. Tagged by seniority level so you can study what your interview round will actually test.

0.6
spark.memory.fraction default
10MB
autoBroadcastJoinThreshold default
256MB
AQE skew partition threshold
200
spark.sql.shuffle.partitions default

Core concepts every senior Spark interview tests

Hold these four mental models before answering specific questions. Each is a load-bearing chunk of Spark architecture interviewers expect you to reason from.

Query planning

Catalyst Optimizer

Spark's query optimizer. Runs in four phases: (1) analyze the parsed logical plan, (2) apply rule-based rewrites like predicate pushdown and column pruning, (3) cost-based optimization for join order and physical strategy when table statistics exist, (4) generate the physical plan with whole-stage codegen. Interview-relevant: rules are deterministic and free, but CBO requires you to run ANALYZE TABLE COMPUTE STATISTICS first. Without statistics, join order follows parse order.

Read a physical plan with df.explain(true)
Memory layout

Tungsten Execution Engine

Off-heap binary format that bypasses the JVM's object overhead. DataFrame operations encode rows into a compact format that's cache-friendly and avoids serialization overhead between stages. UDFs in Python or Scala step outside Tungsten and pay serialization costs — that's why a Pandas UDF is often 10–100x slower than the equivalent DataFrame expression. The whole-stage codegen pass compiles entire pipelines down to a single JVM method, eliminating virtual function calls.

Tungsten is why DataFrame >> RDD
Runtime planning

Adaptive Query Execution (AQE)

Re-optimizes the plan at runtime using actual shuffle statistics. Three main behaviors: (1) coalesce post-shuffle partitions when they're smaller than spark.sql.adaptive.advisoryPartitionSizeInBytes (64MB default), (2) switch sort-merge to broadcast when actual data is smaller than autoBroadcastJoinThreshold, (3) split skewed partitions when one exceeds 5x median size AND 256MB. Enabled by default in Spark 3.2+. Disabling AQE is almost never the right answer in modern Spark.

spark.sql.adaptive.enabled = true
Executor sizing

Unified Memory Model

Executor memory splits into reserved (300MB), user (~25%), and unified (~60% via spark.memory.fraction). The unified pool is shared between execution (joins, sorts, aggregations) and storage (cache). Execution can evict storage when it needs more space; cached blocks then spill to disk if persisted with MEMORY_AND_DISK. Past 30GB executor heap, garbage collection becomes a serious bottleneck — most production clusters target 16–24GB executors with 4–5 cores each.

spark.memory.fraction = 0.6

L5 questions — applied diagnosis

Senior-engineer-level questions. Each expects you to reason from concrete evidence to a fix. Read the Spark UI, name the root cause, choose the right config.

L5

A Spark job that ran fine for months suddenly takes 10x longer. Nothing in the code changed. Walk through your diagnosis.

Start with the Spark UI. Compare the slow run to a healthy baseline. Check input size per stage: did the source table grow past the 10MB autoBroadcastJoinThreshold, forcing a switch from broadcast to sort-merge? Check max vs median task duration for skew. Check executor count and container sizes for cluster changes. Check S3 throttling or HDFS slowdowns. The root cause usually isn't in the application code. AQE can silently change join strategies between runs when table statistics shift.

L5

Explain the difference between Catalyst rule-based and cost-based optimization. Give an example where CBO changes the physical plan.

Rule-based optimization applies many deterministic rewrites: predicate pushdown, column pruning, constant folding. These reliably improve the plan. Cost-based optimization uses table statistics (row count, column cardinality, histograms) to choose between alternatives. Example: a three-way join A JOIN B JOIN C. Rules cannot choose join order. CBO estimates intermediate sizes and reorders to join the two smallest tables first. CBO also decides broadcast vs sort-merge based on estimated byte sizes against the 10MB threshold. Without CBO enabled and statistics collected, Spark picks join order based on parse order, which is often orders of magnitude slower.

L5

When would you choose Kryo over Java serialization? What breaks?

Kryo is much faster than Java serialization and produces smaller serialized objects. Use it for shuffle-heavy jobs where serialization dominates wall time, for RDD operations with custom classes, and for large broadcast variables. The tradeoff: you must register classes with Kryo (spark.kryo.classRegistrationRequired). Unregistered classes fall back to writing the full class name per object, which can be slower than Java serialization. Kryo does not handle all Java types out of the box; some collection types and third-party classes need custom serializers. DataFrame operations use Tungsten binary format internally, so Kryo only helps RDD code paths and UDFs. If your job is DataFrame-only with no UDFs, switching to Kryo changes nothing.

L5

You see spark.speculation kill and re-launch a task 4 times. What is happening and should you disable speculation?

Speculation launches a duplicate task when the original takes longer than spark.speculation.multiplier (default 1.5x) times the median task duration. If the task keeps getting killed, the root cause is likely data skew, not a slow node. The speculative copy lands on the same skewed partition and also runs slow. Disabling speculation is the wrong fix. Instead, fix the skew: salt the key, increase parallelism, or enable AQE skew join handling. Speculation is useful for stragglers caused by hardware issues (degraded disks, noisy neighbors). It is harmful for stragglers caused by data skew because it doubles resource consumption without fixing the bottleneck.

L5

Your Spark job writes 50,000 small files to S3. Downstream queries are slow. What happened and how do you fix it?

50,000 files means 50,000 tasks in the final stage, each writing one file. The default shuffle.partitions is 200, but if the job repartitions or the DAG has multiple shuffles, partition count can multiply. Downstream reads suffer because each file requires a separate S3 LIST and GET call. Fixes: coalesce() before write to reduce output files, but coalesce can create skewed output if partition sizes vary. AQE coalescePartitions merges small post-shuffle partitions automatically. Delta Lake and Iceberg compact small files on write. maxRecordsPerFile caps individual file sizes. The tradeoff: fewer files means larger individual files, which increases memory per reader task.

L5

What does cache() actually do, and when does it hurt performance?

cache() marks a DataFrame for storage at MEMORY_AND_DISK level the first time it's evaluated. The next action reuses cached partitions instead of recomputing the lineage. It hurts in three cases: (1) the DataFrame is used only once — caching adds serialization overhead with no reuse, (2) the cache evicts data needed by a parallel job in the same SparkContext, causing recomputation elsewhere, (3) caching a tiny DataFrame is slower than recomputing it because cache miss handling and storage memory allocation have fixed overhead. Persist with MEMORY_ONLY only when you know the data fits comfortably; otherwise MEMORY_AND_DISK avoids cliff-edge failures.

Spark execution glossary

Eight terms that come up in every senior interview. Confusing application with job, or stage with task, is a fast track to a no-hire.

TermDefinition
ApplicationOne SparkSession, started by spark-submit or a notebook. One driver, many executors.
JobTriggered by an action (count, collect, write). Composed of stages.
StageA sequential set of tasks with no shuffle between them. New stage starts at every shuffle boundary.
TaskOne unit of work per partition per stage. Runs on a single executor core.
ShuffleCross-network repartition of data. Triggered by joins on non-bucketed columns, groupBy, distinct, repartition, window functions.
DAGThe Directed Acyclic Graph of stages for a single job. Built by Catalyst from the logical plan.
ExecutorJVM process running on a worker node. Holds memory and CPU cores. Runs tasks for the application.
DriverJVM process that hosts SparkContext and coordinates work. Runs the SparkSession and submits jobs.

Production failure modes you must be able to diagnose

Four classes of failure account for the majority of Spark incidents. Every L5+ interview includes a diagnosis question from one of these.

Most common interview topic

Data Skew

One partition holds far more data than the others. Symptom: 99% of tasks finish in seconds, the last 1% takes hours. In Spark UI, you see max task duration 100x median, max input bytes per task wildly off the median. Root cause: a join or groupBy key with a heavy hitter. Fixes: enable AQE skew join handling, salt the heavy key with a random integer 0–N then aggregate twice, broadcast the small side if it fits, or pre-aggregate before the join.

Salt = append random key, replicate, join, drop salt
Storage layer

Small Files Problem

Job writes thousands of tiny files to S3. Downstream reads suffer because each file requires a separate S3 LIST and GET. Cause: high shuffle.partitions (default 200) combined with low data volume per partition, or aggressive repartition before write. Fixes: coalesce(N) where N matches your target file size, enable AQE coalescePartitions, use maxRecordsPerFile to cap individual files, run Delta/Iceberg OPTIMIZE compaction post-write.

Target 128MB–1GB per file in S3/HDFS
Memory bound

Executor OOM

Executor JVM runs out of memory mid-task and gets killed. Spark UI shows ExecutorLostFailure with exit code 137. Causes: collect() on a large DataFrame pulls to the driver, a UDF accumulates state, a shuffle partition exceeds available execution memory, or off-heap memory grows past spark.executor.memoryOverhead. Fixes: increase executor.memoryOverhead (default 10% of heap, often too low for PyArrow workloads), reduce shuffle.partitions for memory headroom per task, repartition before the failing stage, eliminate UDF state accumulation.

Default memoryOverhead is too low for Python UDFs
JVM tuning

GC Pause Cascade

Tasks slow to a crawl, executors lose heartbeat, driver marks executors as dead and retries. Cause: heap pressure forces frequent full GC, pauses exceed network timeout (spark.network.timeout default 120s). Often happens with large executors (>30GB) using the default parallel GC. Fixes: switch to G1GC with -XX:+UseG1GC, reduce executor heap size and increase executor count, profile with -verbose:gc to confirm. Counterintuitively, smaller executors often outperform larger ones at scale.

If GC time > 10% of task time, you have a problem

L6 questions — system design and tradeoffs

Staff-track questions. Each expects you to architect an end-to-end solution, name the tradeoffs, and pick the right configs at scale.

L6

You need to join a 2TB table with a 500GB table. Both exceed the 10MB broadcast threshold. How do you optimize this join?

Sort-merge join is the only option at this scale. Optimize by filtering both sides before the join to reduce shuffle volume. If both tables are repeatedly joined on the same key, bucket them with identical bucket counts and the same join key. Bucketed tables skip both the shuffle and the sort. If one side has a hot key holding 15GB+ of data, salt that key: append a random integer 0-N, replicate the other side N times, join on (key, salt), then drop the salt column. AQE handles moderate skew automatically when a partition exceeds 256MB (the default skewedPartitionThresholdInBytes). The tradeoff with bucketing: upfront write cost is high, but it amortizes across every downstream join.

L6

Explain how AQE handles skew joins. What are the limitations?

AQE detects skew after the shuffle exchange by comparing actual partition sizes. When a partition exceeds both 5x the median size and 256MB (spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes), AQE splits it into subpartitions and replicates the matching partition from the other side. Limitations: only works for sort-merge joins, not broadcast or shuffle-hash. Replication increases memory pressure on the non-skewed side. Cannot detect skew before the shuffle because the data must materialize first. The threshold is static, not workload-adaptive. Does not help with aggregation skew, only join skew. If your skew is in a groupBy, you still need manual salting.

L6

Design the executor memory layout for a job that caches 100GB of reference data and runs a 500GB sort-merge join.

Executor memory splits into a unified pool controlled by spark.memory.fraction (default 0.6). On a 20GB executor, that gives 12GB to the unified pool, divided between execution (shuffles, joins, sorts) and storage (cache). Execution can evict storage when it needs space. For this workload: use 20-30GB executors, 50-100 of them. The 100GB cache needs at least 8-10 executors to hold in memory. Use MEMORY_AND_DISK so evicted cache blocks spill to disk instead of recomputing. The 500GB sort-merge join needs execution memory for sort buffers across many executors. Monitor executor.memoryOverhead for off-heap usage. The tradeoff: larger executors reduce shuffle overhead but increase GC pause times. Past 30GB per executor, GC becomes the bottleneck.

L6

A team wants to migrate from daily batch Spark jobs to Structured Streaming. What breaks?

State management is the biggest risk. Streaming jobs accumulate state for windows, dedup, and aggregations. Without size limits, state grows unbounded and causes OOM. Exactly-once delivery requires idempotent sinks or transactional writes (Delta Lake, Iceberg). Late-arriving data needs watermarks, and the watermark duration is a direct tradeoff between completeness and latency. Monitoring shifts from 'did the batch succeed' to 'is streaming lag under SLA.' Schema evolution is harder because changing the schema means restarting the streaming query and potentially rebuilding state. Backfilling historical data still requires batch mode. Micro-batch (the default) processes with 100ms+ latency. Continuous mode reaches sub-second latency but only supports map-like operations.

L6

Your Spark job reads 10TB from Parquet, filters to 1%, then aggregates. The filter step is shockingly slow. What's happening?

Parquet predicate pushdown is failing. Three common causes: (1) the filter expression uses a UDF, which Spark cannot push down — rewrite as a native expression, (2) the filter column is not in the Parquet file's min/max statistics, so Spark reads every row group — confirm with df.explain() showing PushedFilters, (3) the row group size is too large, causing reads to fetch megabytes of data per row group even when the filter excludes them all. Fixes: ensure filters use native expressions, sort the source table by the filter column at write time so values cluster into row groups, reduce parquet.block.size to 64MB for filter-heavy workloads.

Reading the Spark UI for a slow stage

# Reading the Spark UI for a slow stage:
#
# Stages tab -> click the slow stage -> Tasks table
#
# Column          | Healthy    | Skewed     | OOM at risk
# ----------------+------------+------------+-------------
# Duration (max)  | < 2x med   | > 10x med  | growing
# Input Size      | uniform    | one huge   | uniform
# Shuffle Read    | uniform    | one huge   | one huge
# GC Time         | < 10% task | any        | > 25% task
# Memory Spill    | 0          | possible   | growing
#
# If max duration >> median, it's skew. Look at Summary
# Metrics for that stage: min/median/max input size.
# If max input is 100x median, salt the join key.
#
# Quick CLI diagnosis from a Spark history server URL:
spark-shell --conf spark.eventLog.dir=s3://bucket/spark-events
val app = "application_1234567890_0001"
val ui = SparkUI.create(...)
ui.stages.foreach { s =>
  val taskMetrics = s.taskInfos.map(_.duration)
  val median = taskMetrics.sorted.apply(taskMetrics.size/2)
  val max = taskMetrics.max
  if (max > 10 * median) println(s"SKEW in stage ${s.stageId}")
}

A concrete diagnosis pattern interviewers expect you to walk through verbally. The skew threshold (max > 10x median) is the most-asked-about ratio.

L7 questions — multi-team and multi-tenant

Principal-track questions. Each expects you to design for thousands of pipelines, hundreds of teams, and petabyte-scale data.

L7

How would you design a Spark application that processes 100+ PB across a shared multi-tenant cluster?

At 100+ PB scale (Netflix, Uber, Apple run workloads this size), the bottleneck shifts from compute to resource isolation. Use dynamic allocation with strict min/max executor bounds per tenant. Set spark.dynamicAllocation.maxExecutors per job to prevent one tenant from starving others. A production cluster at this scale runs 50-500 executors per job, 4-8 cores each. Use fair scheduler pools with weighted queues. Separate fast (< 10 min) and slow (> 1 hr) workloads into different pools. External shuffle service is mandatory so executors can be released while shuffle data persists. Monitor shuffle spill: if spill-to-disk exceeds 20% of shuffle volume, increase executor memory. The tradeoff: tighter resource limits improve fairness but increase queue wait times.

L7

How do you roll out a major Spark version upgrade (3.3 → 3.5) across 10,000+ pipelines?

Three-phase rollout. Phase 1: stand up the new runtime in a parallel cluster, identify behavior differences with Spark's official upgrade guide (config defaults, deprecated APIs, type coercion changes). Most-impacting changes in recent releases: AQE default-on (3.2), Kryo registration enforcement (3.4), Parquet timestamp encoding (3.3). Phase 2: shadow-run 1% of pipelines on the new runtime, diff outputs row-by-row, alert on any divergence. Phase 3: migrate by team, oldest pipelines first because they're most likely to depend on deprecated behavior. Reserve rollback capacity for at least one week post-migration. The hardest failures are subtle correctness regressions (timestamp parsing, decimal precision) that pass tests but ship wrong numbers to downstream systems.

L7

Design a system that lets data scientists query a 100PB lake with Spark SQL while keeping cost predictable.

Three layers. (1) Storage: Iceberg or Delta with strict partitioning by date and a high-cardinality bucketing column, plus z-order on frequent filter columns. (2) Compute: cluster pools sized by query class — a small interactive pool with autoscale floor, a large batch pool with no floor. Route queries via a SQL gateway that classifies them based on estimated scan size from table statistics. Queries projected to scan >1TB without partition filter get rejected. (3) Cost: per-team usage quotas enforced via the SQL gateway. Cache frequently-accessed dimension tables (employees, products) cluster-wide via the Delta cache. Pre-materialize common aggregations into smaller summary tables and route queries via materialized view rewriting. Net effect: 90% of queries hit small summary tables, the other 10% pay the full scan cost.

Frequently asked questions

How deep should senior engineers know Spark internals?+
You should read a physical plan, diagnose shuffle bottlenecks from the Spark UI, explain the 60/40 memory split, and design for skew and fault tolerance. You do not need to know the scheduler source code. You do need to know why specific configurations exist (shuffle.partitions, autoBroadcastJoinThreshold, speculation.multiplier) and when to change them.
How are these different from general Spark interview questions?+
General Spark questions test architecture: what a DAG is, how lazy evaluation works, what a shuffle is. These questions assume you already know that. They test whether you can size executors, diagnose production incidents from Spark UI evidence, and make tradeoff decisions under constraints. The difference: knowing what a join is, versus knowing when to salt a skewed key.
What seniority level do these questions target?+
L5 through L7. L5 questions test applied knowledge: can you diagnose a problem and pick the right fix. L6 questions test system design: can you reason about memory layouts and architectural tradeoffs. L7 questions test organizational impact: multi-tenant clusters, resource isolation, cross-team pipeline design.
Will I need to write Spark code in the interview?+
Depends on the company and round. Coding rounds usually use PySpark DataFrame API for data transformations. System design rounds rarely require code. Diagnosis rounds expect you to read a Spark UI screenshot and explain what's wrong. Practice all three formats. The PySpark DataFrame syntax is the highest-yield skill: every Spark interview at every level tests it.
Do these questions apply to Databricks specifically?+
Yes, with caveats. Databricks Runtime adds Photon (vectorized engine), Delta caching, and SQL warehouses on top of open-source Spark. Photon changes physical plan inspection — operators show as 'Photon...' in the plan. Delta caching is a separate layer from spark.catalog.cacheTable. Otherwise the fundamentals are identical: Catalyst, Tungsten, AQE, shuffle, memory model. Open-source Spark knowledge transfers directly to Databricks roles.
02 / Why practice

The candidate who gets the offer

  1. 01

    Active recall beats re-reading by 50%

    Cognitive-science meta-reviews (Dunlosky et al., 2013) rank practice testing as a top-tier study technique, while re-reading and highlighting rank near the bottom

  2. 02

    76% of hiring managers reject on the coding task, not the resume

    From HackerRank's 2024 Developer Skills Report. Candidates who look strong on paper still fail the live screen if they haven't done timed, executable practice

  3. 03

    Five problem shapes cover 80% of data engineer loops

    Dedup, sessionization, top-N-per-group, slowly-changing dimensions, partition tricks. Writing the shapes by hand turns the unfamiliar into pattern recognition

Related interview prep