Spark Interview Questions

Apache Spark Interview Questions for Data Engineers (2026)

Spark processes over 100 PB daily at companies like Netflix, Uber, and Apple. Interview questions focus on architecture, execution model, and performance tuning. Knowing how to write a query is table stakes. Knowing why it runs slowly is what gets you the offer.
Updated April 2026·By The DataDriven Team

TL;DR: Spark Interview Questions

Apache Spark is a distributed processing engine built on a driver/executor architecture that compiles transformations into a DAG and executes them lazily. Data engineering interviews focus on six areas: driver and executor architecture, DAG and stages, narrow vs wide transformations and shuffles, join strategies, Catalyst and Adaptive Query Execution, and memory tuning.

For most roles you need to walk through what happens when spark-submit runs, explain the difference between narrow and wide transformations, compare broadcast hash join vs sort-merge join, and diagnose a job that OOMs in production. Senior roles add Catalyst plan reading, AQE configuration, dynamic resource allocation, and integration with Delta Lake or Iceberg.

Coverage
Covers Spark 3.x features including Adaptive Query Execution, dynamic partition pruning, and the unified memory model.

Spark Join Strategies: Decision Matrix

Knowing when Spark picks each join strategy is one of the most common Spark interview questions. Memorize this table.

Join strategyWhen Spark picks itShuffle?Best for
Broadcast hash joinOne side < spark.sql.autoBroadcastJoinThreshold (default 10MB)NoneJoining a large fact with a small dim table
Sort-merge joinDefault for large-to-large equi-joins; one or both sides too big to broadcastBoth sidesTwo large tables on the same key
Shuffle hash joinOne side fits in memory after partitioning; AQE may pick this when sort-merge would be cheaperBoth sidesMedium-large + medium-small joins
Broadcast nested loopNon-equi joins (range/inequality) where one side is smallNoneBETWEEN, LIKE, custom predicate joins

What Spark Interviewers Expect

Spark interview questions test three layers of understanding. The first is whether you know the architecture: drivers, executors, stages, and tasks. The second is whether you can write efficient transformations. The third is whether you can diagnose and fix production performance problems.

Junior

Junior candidates

Should explain the execution model clearly: how a DAG is built, what triggers execution, and why shuffles are expensive.
Mid

Mid-level candidates

Need to compare join strategies, explain AQE, and discuss partitioning choices. You should be comfortable reading a Spark physical plan.
Senior

Senior candidates

Are expected to design Spark applications for reliability and efficiency at scale. This includes memory tuning, dynamic allocation, fault tolerance tradeoffs, and integration with storage layers like Delta Lake or Iceberg.

Spark Core Concepts Interviewers Test

Six concepts surface in nearly every Spark interview. If you can explain each one in your own words and connect it to a real performance problem, most questions become straightforward.

Master these six and most Spark interviews unlock
  • Driver and Executor Architecture: the driver coordinates execution, parses your code, builds the DAG, and schedules tasks. Executors run those tasks and store data. Interviewers want you to explain what happens when the driver runs out of memory (usually a collect() or broadcast gone wrong) versus when an executor OOMs (usually a skewed partition or too much cached data).
  • DAG and Stage Execution: Spark builds a Directed Acyclic Graph of transformations. Shuffle boundaries create stage breaks. Within a stage, tasks run in parallel on partitions. You need this to read the Spark UI and diagnose why a job is slow.
  • Narrow vs Wide Transformations: narrow transformations (map, filter, union) process each partition independently. Wide transformations (groupBy, join, repartition) require data movement across the cluster. Every wide transformation creates a shuffle, which writes intermediate data to disk and transfers it over the network.
  • Catalyst Optimizer: Catalyst converts logical plans into optimized physical plans. It applies 100+ optimization rules including predicate pushdown, column pruning, constant folding, and join reordering. Understanding Catalyst explains why DataFrame operations outperform equivalent RDD code and why UDFs block optimization.
  • Adaptive Query Execution (AQE): AQE re-optimizes the query plan at runtime based on actual data statistics. It coalesces small partitions after shuffles, switches join strategies when one side is smaller than expected, and optimizes skew joins. Available in Spark 3.0+ and enabled by default in 3.2+.
  • Memory Management: executor memory is split roughly 60% to the unified pool (spark.memory.fraction = 0.6), shared between execution (shuffles, joins, sorts) and storage (cache). The unified model lets one borrow from the other. Interviewers test whether you can tune spark.executor.memory, spark.memory.fraction, and spark.memory.storageFraction to solve specific problems.

Spark Interview Questions with Guidance

Ten questions that come up in real Spark loops, each with what a strong answer covers.

Q01

Walk through what happens when you submit a Spark application. Cover every component from spark-submit to task completion.

A strong answer includes:

spark-submit sends the application to the cluster manager (YARN, K8s, or standalone). The cluster manager allocates a container for the driver. The driver initializes SparkContext, negotiates executor containers, and sends the application JAR to executors. When an action triggers execution, the driver builds a DAG, splits it into stages at shuffle boundaries, and creates tasks (one per partition per stage). The scheduler sends tasks to executors. Executors run the tasks, read/write shuffle data, and report results back to the driver.

Q02

Explain the difference between narrow and wide dependencies with concrete examples. Why does this distinction matter?

A strong answer includes:

Narrow dependencies mean each parent partition is used by at most one child partition: map(), filter(), union(). Wide dependencies mean multiple child partitions depend on a single parent: groupByKey(), reduceByKey(), join(). This matters because wide dependencies create shuffle boundaries, which are the most expensive operation in Spark. They force data serialization, disk writes, and network transfer. A strong answer connects this to stage boundaries in the DAG and explains how minimizing shuffles improves performance.

Q03

What are the different join strategies Spark can use? How does it decide which one to pick?

A strong answer includes:

Broadcast hash join: sends the small table to all executors, no shuffle needed. Sort-merge join: both sides are shuffled and sorted by the join key, then merged. Shuffle hash join: both sides are shuffled by join key, the smaller side builds a hash table per partition. Broadcast nested loop join: fallback for non-equi joins with a small table. Spark decides based on table sizes, join type, and hints. autoBroadcastJoinThreshold controls the broadcast cutoff. AQE can switch strategies at runtime if actual sizes differ from estimates.

Q04

A Spark job runs fine on small data but OOMs on production data. Walk through your debugging approach.

A strong answer includes:

First, check the Spark UI to identify which stage fails. If the driver OOMs, look for collect(), broadcast of large tables, or large accumulator values. If an executor OOMs, check for skewed partitions (one task processing far more data), insufficient executor memory, or excessive caching. Fixes include increasing executor memory, repartitioning skewed data, salting join keys, replacing collect() with write operations, or reducing cache usage. A strong answer mentions checking GC logs and the difference between on-heap and off-heap memory.

Q05

Explain Spark SQL and how it relates to the DataFrame API. Can they be used together?

A strong answer includes:

Spark SQL lets you run SQL queries on registered temporary views. The DataFrame API provides programmatic access to the same operations. Both compile to the same logical plan and go through the Catalyst optimizer. You can freely mix them: create a temp view from a DataFrame, query it with SQL, and convert the result back to a DataFrame. A strong answer notes that Spark SQL supports ANSI SQL, window functions, CTEs, and subqueries, and that performance is identical between SQL and DataFrame approaches.

Q06

What is speculative execution in Spark? When does it help and when does it hurt?

A strong answer includes:

Speculative execution launches duplicate copies of slow tasks on other executors. If the duplicate finishes first, Spark kills the original. It helps when slowness is caused by a bad node (disk failure, network congestion). It hurts when slowness is caused by data skew, because the duplicate task gets the same skewed partition and is just as slow. It also wastes resources by running redundant tasks. A strong answer mentions spark.speculation.quantile and spark.speculation.multiplier as tuning knobs.

Q07

How does Spark handle fault tolerance? What happens when an executor dies mid-job?

A strong answer includes:

Spark tracks the lineage (DAG) of every RDD/DataFrame. When an executor dies, the driver reassigns its tasks to surviving executors. Those executors recompute the lost partitions by replaying the lineage from the last shuffle checkpoint. Shuffle data written to disk by earlier stages is preserved; only the current stage re-executes. If shuffle files are also lost (the node died), Spark re-runs the upstream stages. A strong answer mentions that caching provides a shortcut in the lineage but cached data on the dead executor is lost.

Q08

What is dynamic resource allocation and when should you enable it?

A strong answer includes:

Dynamic allocation lets Spark add and release executors based on workload. When tasks are queued, Spark requests more executors. When executors are idle, Spark releases them. This is useful for long-running jobs with variable load (ETL pipelines that read many small tables then join one large table). It saves cluster resources by not holding executors you do not need. A strong answer mentions the external shuffle service requirement: without it, Spark cannot release executors that hold shuffle data.

Q09

Explain the difference between client mode and cluster mode in Spark. When would you choose each?

A strong answer includes:

In client mode, the driver runs on the machine that submitted the job. In cluster mode, the driver runs inside the cluster on a worker node. Client mode is useful for interactive development (notebooks, spark-shell) because you can see stdout directly. Cluster mode is better for production because the driver benefits from cluster resources and network proximity to executors. A strong answer notes that in client mode, the driver's machine becomes a single point of failure and a network bottleneck.

Q10

How would you optimize a Spark job that reads from a data lake with thousands of small files?

A strong answer includes:

Small files create excessive task overhead because Spark creates one task per file by default. Solutions: use coalesce on the input (spark.sql.files.maxPartitionBytes), enable file listing parallelism (spark.sql.sources.parallelPartitionDiscovery.threshold), compact small files into larger ones as a preprocessing step, or use Delta Lake / Iceberg which handle compaction automatically. A strong answer mentions that the driver also suffers because it must list and plan thousands of files, which can cause driver OOM or long planning times.

Practice Spark in Your Browser

Memorizing answers does not build the muscle memory interviewers test for. DataDriven runs a Spark-compatible execution engine in your browser that handles both PySpark and Scala syntax. Write joins, aggregations, and window functions against real datasets, then run mock interviews where an AI interviewer probes your understanding of the execution plan.

Worked Example: Reading a Spark Explain Plan

Interviewers often show you a physical plan and ask what it means. Here is a simple join plan and how to read it.

Physical plan

Inner join, two scans, one shuffle per side

1== Physical Plan ==
2*(3) SortMergeJoin [customer_id], [customer_id], Inner
3:- *(1) Sort [customer_id ASC], false, 0
4: +- Exchange hashpartitioning(customer_id, 200)
5: +- *(1) Filter isnotnull(customer_id)
6: +- *(1) Scan parquet orders [customer_id, amount]
7+- *(2) Sort [customer_id ASC], false, 0
8 +- Exchange hashpartitioning(customer_id, 200)
9 +- *(2) Filter isnotnull(customer_id)
10 +- *(2) Scan parquet customers [customer_id, name]
11
12# Read bottom-up:
13# 1. Both tables are scanned with column pruning
14# 2. Null keys are filtered (cannot match in inner join)
15# 3. Exchange = shuffle by customer_id into 200 partitions
16# 4. Both sides are sorted within each partition
17# 5. SortMergeJoin merges the sorted streams
The key insight: two Exchange nodes mean two shuffles. If the customers table is small enough, a broadcast join would eliminate both shuffles. Mentioning this optimization in an interview shows you can read plans and reason about alternatives.

Common Spark Interview Mistakes

Patterns that flag a candidate as someone who has read about Spark but not run it under pressure.

Pitfall
Explaining what Spark does without understanding why: knowing that groupByKey is bad without being able to explain the memory implications
Pitfall
Confusing shuffle partitions (spark.sql.shuffle.partitions, default 200) with input partitions (the number of files or blocks read)
Pitfall
Claiming Spark is always faster than single-node processing, ignoring the overhead of serialization and network transfer for small datasets
Pitfall
Not mentioning AQE when discussing join optimization, which is a standard feature in Spark 3.x
Pitfall
Treating RDDs and DataFrames as interchangeable without discussing Catalyst optimization and Tungsten execution
Pitfall
Ignoring executor memory overhead (spark.executor.memoryOverhead), which causes container kills on YARN/K8s

Spark Interview Questions FAQ

How deeply should I understand Spark internals for a data engineering interview?+
Know the driver/executor model, how shuffles work, what stages and tasks are, and how to read the Spark UI. You do not need to understand the scheduler implementation or network protocol details. Focus on being able to diagnose performance problems and explain your reasoning.
Do companies still ask about RDDs?+
Rarely as a primary topic, but RDD knowledge demonstrates depth. You should understand that DataFrames are built on top of RDDs, that RDD lineage provides fault tolerance, and that RDDs bypass Catalyst optimization. Some companies ask about RDD operations to test fundamental understanding.
Should I study Spark with Scala or Python for interviews?+
PySpark accounts for roughly 70% of Spark API usage based on GitHub activity, with Scala at about 25%. Study whichever the job description uses. If neither is specified, PySpark is the safer choice because it is more common in data engineering roles. The concepts (DAGs, shuffles, partitioning) are language-agnostic.
What is the difference between Spark interview questions for data engineers versus data scientists?+
Data engineering interviews focus on performance tuning, architecture, fault tolerance, and production reliability. Data science interviews focus on MLlib, feature engineering, and model serving. If you are interviewing for a DE role, prioritize shuffle optimization, join strategies, and memory management over ML pipelines.
What are the core concepts of Apache Spark interviewers always ask about?+
Driver and executor architecture, the DAG (directed acyclic graph) and stage execution model, narrow vs wide transformations, the shuffle, lazy evaluation, the Catalyst optimizer, Adaptive Query Execution, and unified memory management. Master these eight and most Spark interview questions become straightforward.
What is lazy evaluation in Spark and why does it matter?+
Spark builds a DAG of transformations but does not execute anything until an action (count, show, collect, write) is called. This lets the Catalyst optimizer rearrange and combine operations across the entire query before execution. Interviewers test this because it explains why a single show() can take longer than a write() with many transformations.
What are the main Spark transformations interviewers ask about?+
Narrow: map, filter, flatMap, union, sample. Wide: groupByKey, reduceByKey, join, cogroup, repartition, distinct. Wide transformations create shuffle boundaries and stage breaks. Strong candidates can predict whether a transformation creates a shuffle just by looking at the code.
What is the difference between Apache Spark and Apache Flink?+
Spark is batch-first with streaming layered on top via micro-batching (Structured Streaming). Flink is streaming-first with batch as a special case. Flink offers lower latency for true event-time streaming, while Spark is more mature for batch and ML workloads. Most data engineering interviews focus on Spark; Flink comes up for low-latency streaming roles.

Practice Spark Interview Questions

Build the distributed systems intuition that interviewers test for. Practice with real execution environments and immediate feedback.

Continue your prep

Data Engineer Interview Prep, explore the full guide

50+ guides covering every round, company, role, and technology in the data engineer interview loop. Grounded in 2,817 verified interview reports across 921 companies, collected from real candidates.

Interview Rounds

By Company

By Role

By Technology

Decisions

Question Formats