Apache Spark Interview Questions for Data Engineers (2026)
TL;DR: Spark Interview Questions
Apache Spark is a distributed processing engine built on a driver/executor architecture that compiles transformations into a DAG and executes them lazily. Data engineering interviews focus on six areas: driver and executor architecture, DAG and stages, narrow vs wide transformations and shuffles, join strategies, Catalyst and Adaptive Query Execution, and memory tuning.
For most roles you need to walk through what happens when spark-submit runs, explain the difference between narrow and wide transformations, compare broadcast hash join vs sort-merge join, and diagnose a job that OOMs in production. Senior roles add Catalyst plan reading, AQE configuration, dynamic resource allocation, and integration with Delta Lake or Iceberg.
Spark Join Strategies: Decision Matrix
Knowing when Spark picks each join strategy is one of the most common Spark interview questions. Memorize this table.
| Join strategy | When Spark picks it | Shuffle? | Best for |
|---|---|---|---|
| Broadcast hash join | One side < spark.sql.autoBroadcastJoinThreshold (default 10MB) | None | Joining a large fact with a small dim table |
| Sort-merge join | Default for large-to-large equi-joins; one or both sides too big to broadcast | Both sides | Two large tables on the same key |
| Shuffle hash join | One side fits in memory after partitioning; AQE may pick this when sort-merge would be cheaper | Both sides | Medium-large + medium-small joins |
| Broadcast nested loop | Non-equi joins (range/inequality) where one side is small | None | BETWEEN, LIKE, custom predicate joins |
What Spark Interviewers Expect
Spark interview questions test three layers of understanding. The first is whether you know the architecture: drivers, executors, stages, and tasks. The second is whether you can write efficient transformations. The third is whether you can diagnose and fix production performance problems.
Junior candidates
Mid-level candidates
Senior candidates
Spark Core Concepts Interviewers Test
Six concepts surface in nearly every Spark interview. If you can explain each one in your own words and connect it to a real performance problem, most questions become straightforward.
- Driver and Executor Architecture: the driver coordinates execution, parses your code, builds the DAG, and schedules tasks. Executors run those tasks and store data. Interviewers want you to explain what happens when the driver runs out of memory (usually a collect() or broadcast gone wrong) versus when an executor OOMs (usually a skewed partition or too much cached data).
- DAG and Stage Execution: Spark builds a Directed Acyclic Graph of transformations. Shuffle boundaries create stage breaks. Within a stage, tasks run in parallel on partitions. You need this to read the Spark UI and diagnose why a job is slow.
- Narrow vs Wide Transformations: narrow transformations (map, filter, union) process each partition independently. Wide transformations (groupBy, join, repartition) require data movement across the cluster. Every wide transformation creates a shuffle, which writes intermediate data to disk and transfers it over the network.
- Catalyst Optimizer: Catalyst converts logical plans into optimized physical plans. It applies 100+ optimization rules including predicate pushdown, column pruning, constant folding, and join reordering. Understanding Catalyst explains why DataFrame operations outperform equivalent RDD code and why UDFs block optimization.
- Adaptive Query Execution (AQE): AQE re-optimizes the query plan at runtime based on actual data statistics. It coalesces small partitions after shuffles, switches join strategies when one side is smaller than expected, and optimizes skew joins. Available in Spark 3.0+ and enabled by default in 3.2+.
- Memory Management: executor memory is split roughly 60% to the unified pool (spark.memory.fraction = 0.6), shared between execution (shuffles, joins, sorts) and storage (cache). The unified model lets one borrow from the other. Interviewers test whether you can tune spark.executor.memory, spark.memory.fraction, and spark.memory.storageFraction to solve specific problems.
Spark Interview Questions with Guidance
Ten questions that come up in real Spark loops, each with what a strong answer covers.
Walk through what happens when you submit a Spark application. Cover every component from spark-submit to task completion.
spark-submit sends the application to the cluster manager (YARN, K8s, or standalone). The cluster manager allocates a container for the driver. The driver initializes SparkContext, negotiates executor containers, and sends the application JAR to executors. When an action triggers execution, the driver builds a DAG, splits it into stages at shuffle boundaries, and creates tasks (one per partition per stage). The scheduler sends tasks to executors. Executors run the tasks, read/write shuffle data, and report results back to the driver.
Explain the difference between narrow and wide dependencies with concrete examples. Why does this distinction matter?
Narrow dependencies mean each parent partition is used by at most one child partition: map(), filter(), union(). Wide dependencies mean multiple child partitions depend on a single parent: groupByKey(), reduceByKey(), join(). This matters because wide dependencies create shuffle boundaries, which are the most expensive operation in Spark. They force data serialization, disk writes, and network transfer. A strong answer connects this to stage boundaries in the DAG and explains how minimizing shuffles improves performance.
What are the different join strategies Spark can use? How does it decide which one to pick?
Broadcast hash join: sends the small table to all executors, no shuffle needed. Sort-merge join: both sides are shuffled and sorted by the join key, then merged. Shuffle hash join: both sides are shuffled by join key, the smaller side builds a hash table per partition. Broadcast nested loop join: fallback for non-equi joins with a small table. Spark decides based on table sizes, join type, and hints. autoBroadcastJoinThreshold controls the broadcast cutoff. AQE can switch strategies at runtime if actual sizes differ from estimates.
A Spark job runs fine on small data but OOMs on production data. Walk through your debugging approach.
First, check the Spark UI to identify which stage fails. If the driver OOMs, look for collect(), broadcast of large tables, or large accumulator values. If an executor OOMs, check for skewed partitions (one task processing far more data), insufficient executor memory, or excessive caching. Fixes include increasing executor memory, repartitioning skewed data, salting join keys, replacing collect() with write operations, or reducing cache usage. A strong answer mentions checking GC logs and the difference between on-heap and off-heap memory.
Explain Spark SQL and how it relates to the DataFrame API. Can they be used together?
Spark SQL lets you run SQL queries on registered temporary views. The DataFrame API provides programmatic access to the same operations. Both compile to the same logical plan and go through the Catalyst optimizer. You can freely mix them: create a temp view from a DataFrame, query it with SQL, and convert the result back to a DataFrame. A strong answer notes that Spark SQL supports ANSI SQL, window functions, CTEs, and subqueries, and that performance is identical between SQL and DataFrame approaches.
What is speculative execution in Spark? When does it help and when does it hurt?
Speculative execution launches duplicate copies of slow tasks on other executors. If the duplicate finishes first, Spark kills the original. It helps when slowness is caused by a bad node (disk failure, network congestion). It hurts when slowness is caused by data skew, because the duplicate task gets the same skewed partition and is just as slow. It also wastes resources by running redundant tasks. A strong answer mentions spark.speculation.quantile and spark.speculation.multiplier as tuning knobs.
How does Spark handle fault tolerance? What happens when an executor dies mid-job?
Spark tracks the lineage (DAG) of every RDD/DataFrame. When an executor dies, the driver reassigns its tasks to surviving executors. Those executors recompute the lost partitions by replaying the lineage from the last shuffle checkpoint. Shuffle data written to disk by earlier stages is preserved; only the current stage re-executes. If shuffle files are also lost (the node died), Spark re-runs the upstream stages. A strong answer mentions that caching provides a shortcut in the lineage but cached data on the dead executor is lost.
What is dynamic resource allocation and when should you enable it?
Dynamic allocation lets Spark add and release executors based on workload. When tasks are queued, Spark requests more executors. When executors are idle, Spark releases them. This is useful for long-running jobs with variable load (ETL pipelines that read many small tables then join one large table). It saves cluster resources by not holding executors you do not need. A strong answer mentions the external shuffle service requirement: without it, Spark cannot release executors that hold shuffle data.
Explain the difference between client mode and cluster mode in Spark. When would you choose each?
In client mode, the driver runs on the machine that submitted the job. In cluster mode, the driver runs inside the cluster on a worker node. Client mode is useful for interactive development (notebooks, spark-shell) because you can see stdout directly. Cluster mode is better for production because the driver benefits from cluster resources and network proximity to executors. A strong answer notes that in client mode, the driver's machine becomes a single point of failure and a network bottleneck.
How would you optimize a Spark job that reads from a data lake with thousands of small files?
Small files create excessive task overhead because Spark creates one task per file by default. Solutions: use coalesce on the input (spark.sql.files.maxPartitionBytes), enable file listing parallelism (spark.sql.sources.parallelPartitionDiscovery.threshold), compact small files into larger ones as a preprocessing step, or use Delta Lake / Iceberg which handle compaction automatically. A strong answer mentions that the driver also suffers because it must list and plan thousands of files, which can cause driver OOM or long planning times.
Practice Spark in Your Browser
Memorizing answers does not build the muscle memory interviewers test for. DataDriven runs a Spark-compatible execution engine in your browser that handles both PySpark and Scala syntax. Write joins, aggregations, and window functions against real datasets, then run mock interviews where an AI interviewer probes your understanding of the execution plan.
Worked Example: Reading a Spark Explain Plan
Interviewers often show you a physical plan and ask what it means. Here is a simple join plan and how to read it.
Inner join, two scans, one shuffle per side
Common Spark Interview Mistakes
Patterns that flag a candidate as someone who has read about Spark but not run it under pressure.
Spark Interview Questions FAQ
How deeply should I understand Spark internals for a data engineering interview?+
Do companies still ask about RDDs?+
Should I study Spark with Scala or Python for interviews?+
What is the difference between Spark interview questions for data engineers versus data scientists?+
What are the core concepts of Apache Spark interviewers always ask about?+
What is lazy evaluation in Spark and why does it matter?+
What are the main Spark transformations interviewers ask about?+
What is the difference between Apache Spark and Apache Flink?+
Practice Spark Interview Questions
Build the distributed systems intuition that interviewers test for. Practice with real execution environments and immediate feedback.