L5-L7 Interview Prep

Apache Spark Interview Questions for Senior Engineers

10 questions that test production experience, not textbook knowledge. Memory fractions, partition thresholds, executor sizing, and incident debugging. Tagged by seniority level from L5 to L7.

10 questionsL5-L7Avg interview coverage: 67.6%

Senior Apache Spark Interview Questions

Q1

A Spark job that ran fine for months suddenly takes 10x longer. Nothing in the code changed. Walk through your diagnosis.

L5debuggingSpark UIAQE
A strong answer includes:

Start with the Spark UI. Compare the slow run to a healthy baseline. Check input size per stage: did the source table grow past the 10MB autoBroadcastJoinThreshold, forcing a switch from broadcast to sort-merge? Check max vs median task duration for skew. Check executor count and container sizes for cluster changes. Check S3 throttling or HDFS slowdowns. The root cause is almost never in the application code. A strong answer mentions that AQE can silently change join strategies between runs when table statistics shift.

Q2

Explain the difference between Catalyst rule-based and cost-based optimization. Give an example where CBO changes the physical plan.

L5Catalystquery planningCBO
A strong answer includes:

Rule-based optimization applies 100+ deterministic rewrites: predicate pushdown, column pruning, constant folding. These always improve the plan. Cost-based optimization uses table statistics (row count, column cardinality, histograms) to choose between alternatives. Example: a three-way join A JOIN B JOIN C. Rules cannot choose join order. CBO estimates intermediate sizes and reorders to join the two smallest tables first. CBO also decides broadcast vs sort-merge based on estimated byte sizes against the 10MB threshold. Without CBO enabled and statistics collected, Spark picks join order based on parse order, which can be 10-100x slower.

Q3

Your Spark job writes 50,000 small files to S3. Downstream queries are slow. What happened and how do you fix it?

L5small filesS3coalesce
A strong answer includes:

50,000 files means 50,000 tasks in the final stage, each writing one file. The default shuffle.partitions is 200, but if the job repartitions or the DAG has multiple shuffles, partition count can multiply. Downstream reads suffer because each file requires a separate S3 LIST and GET call. Fixes: coalesce() before write to reduce output files, but coalesce can create skewed output if partition sizes vary. AQE coalescePartitions merges small post-shuffle partitions automatically. Delta Lake and Iceberg compact small files on write. maxRecordsPerFile caps individual file sizes. The tradeoff: fewer files means larger individual files, which increases memory per reader task.

Q4

You need to join a 2TB table with a 500GB table. Both exceed the 10MB broadcast threshold. How do you optimize this join?

L6joinsbucketingsaltingAQE
A strong answer includes:

Sort-merge join is the only option at this scale. Optimize by filtering both sides before the join to reduce shuffle volume. If both tables are repeatedly joined on the same key, bucket them with identical bucket counts and the same join key. Bucketed tables skip both the shuffle and the sort. If one side has a hot key holding 15GB+ of data, salt that key: append a random integer 0-N, replicate the other side N times, join on (key, salt), then drop the salt column. AQE handles moderate skew automatically when a partition exceeds 256MB (the default skewedPartitionThresholdInBytes). The tradeoff with bucketing: upfront write cost is high, but it amortizes across every downstream join.

Q5

Explain how AQE handles skew joins. What are the limitations?

L6AQEskewpartitioning
A strong answer includes:

AQE detects skew after the shuffle exchange by comparing actual partition sizes. When a partition exceeds both 5x the median size and 256MB (spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes), AQE splits it into subpartitions and replicates the matching partition from the other side. Limitations: only works for sort-merge joins, not broadcast or shuffle-hash. Replication increases memory pressure on the non-skewed side. Cannot detect skew before the shuffle because the data must materialize first. The threshold is static, not workload-adaptive. Does not help with aggregation skew, only join skew. If your skew is in a groupBy, you still need manual salting.

Q6

Design the executor memory layout for a job that caches 100GB of reference data and runs a 500GB sort-merge join.

L6memorycachingexecutor sizing
A strong answer includes:

Executor memory splits into a unified pool controlled by spark.memory.fraction (default 0.6). On a 20GB executor, that gives 12GB to the unified pool, divided between execution (shuffles, joins, sorts) and storage (cache). Execution can evict storage when it needs space. For this workload: use 20-30GB executors, 50-100 of them. The 100GB cache needs at least 8-10 executors to hold in memory. Use MEMORY_AND_DISK so evicted cache blocks spill to disk instead of recomputing. The 500GB sort-merge join needs execution memory for sort buffers across many executors. Monitor executor.memoryOverhead for off-heap usage. The tradeoff: larger executors reduce shuffle overhead but increase GC pause times. Past 30GB per executor, GC becomes the bottleneck.

Q7

When would you choose Kryo over Java serialization? What breaks?

L5serializationKryoTungsten
A strong answer includes:

Kryo is 10x faster than Java serialization and produces smaller serialized objects. Use it for shuffle-heavy jobs where serialization dominates wall time, for RDD operations with custom classes, and for large broadcast variables. The tradeoff: you must register classes with Kryo (spark.kryo.classRegistrationRequired). Unregistered classes fall back to writing the full class name per object, which can be slower than Java serialization. Kryo does not handle all Java types out of the box. Some collection types and third-party classes need custom serializers. DataFrame operations use Tungsten binary format internally, so Kryo only helps RDD code paths and UDFs. If your job is 100% DataFrames with no UDFs, switching to Kryo changes nothing.

Q8

A team wants to migrate from daily batch Spark jobs to Structured Streaming. What breaks?

L6streamingstate managementexactly-once
A strong answer includes:

State management is the biggest risk. Streaming jobs accumulate state for windows, dedup, and aggregations. Without size limits, state grows unbounded and causes OOM. Exactly-once delivery requires idempotent sinks or transactional writes (Delta Lake, Iceberg). Late-arriving data needs watermarks, and the watermark duration is a direct tradeoff between completeness and latency. Monitoring shifts from 'did the batch succeed' to 'is streaming lag under SLA.' Schema evolution is harder because changing the schema means restarting the streaming query and potentially rebuilding state. Backfilling historical data still requires batch mode. Micro-batch (the default) processes with 100ms+ latency. Continuous mode reaches sub-second latency but only supports map-like operations.

Q9

You see spark.speculation kill and re-launch a task 4 times. What is happening and should you disable speculation?

L5speculationskewdebugging
A strong answer includes:

Speculation launches a duplicate task when the original takes longer than spark.speculation.multiplier (default 1.5x) times the median task duration. If the task keeps getting killed, the root cause is likely data skew, not a slow node. The speculative copy lands on the same skewed partition and also runs slow. Disabling speculation is the wrong fix. Instead, fix the skew: salt the key, increase parallelism, or enable AQE skew join handling. Speculation is useful for stragglers caused by hardware issues (degraded disks, noisy neighbors). It is harmful for stragglers caused by data skew because it doubles resource consumption without fixing the bottleneck.

Q10

How would you design a Spark application that processes 100+ PB across a shared multi-tenant cluster?

L7system designmulti-tenantresource management
A strong answer includes:

At 100+ PB scale (Netflix, Uber, Apple run workloads this size), the bottleneck shifts from compute to resource isolation. Use dynamic allocation with strict min/max executor bounds per tenant. Set spark.dynamicAllocation.maxExecutors per job to prevent one tenant from starving others. A production cluster at this scale runs 50-500 executors per job, 4-8 cores each. Use fair scheduler pools with weighted queues. Separate fast (< 10 min) and slow (> 1 hr) workloads into different pools. External shuffle service is mandatory so executors can be released while shuffle data persists. Monitor shuffle spill: if spill-to-disk exceeds 20% of shuffle volume, increase executor memory. The tradeoff: tighter resource limits improve fairness but increase queue wait times.

Frequently Asked Questions

How deep should senior engineers know Spark internals?+
You should read a physical plan, diagnose shuffle bottlenecks from the Spark UI, explain the 60/40 memory split, and design for skew and fault tolerance. You do not need to know the scheduler source code. You do need to know why specific configurations exist (shuffle.partitions, autoBroadcastJoinThreshold, speculation.multiplier) and when to change them.
How are these different from general Spark interview questions?+
General Spark questions test architecture: what is a DAG, how does lazy evaluation work, what is a shuffle. These questions assume you already know that. They test whether you can size executors, diagnose production incidents from Spark UI evidence, and make tradeoff decisions under constraints. The difference is the same as knowing what a join is vs knowing when to salt a skewed key.
What seniority level do these questions target?+
L5 through L7. Each question is tagged. L5 questions test applied knowledge: can you diagnose a problem and pick the right fix. L6 questions test system design: can you reason about memory layouts and architectural tradeoffs. L7 questions test organizational impact: multi-tenant clusters, resource isolation, cross-team pipeline design.