Apache Spark Interview Questions for Data Engineers (2026)

Q: How deeply should I understand Spark internals for a data engineering interview?

Know the driver/executor model, how shuffles work, what stages and tasks are, and how to read the Spark UI. You do not need to understand the scheduler implementation or network protocol details. Focus on being able to diagnose performance problems and explain your reasoning.

Q: Do companies still ask about RDDs?

Rarely as a primary topic, but RDD knowledge demonstrates depth. You should understand that DataFrames are built on top of RDDs, that RDD lineage provides fault tolerance, and that RDDs bypass Catalyst optimization. Some companies ask about RDD operations to test fundamental understanding.

Q: Should I study Spark with Scala or Python for interviews?

PySpark accounts for roughly 70% of Spark API usage based on GitHub activity, with Scala at about 25%. Study whichever the job description uses. If neither is specified, PySpark is the safer choice because it is more common in data engineering roles. The concepts (DAGs, shuffles, partitioning) are language-agnostic.

Q: What is the difference between Spark interview questions for data engineers versus data scientists?

Data engineering interviews focus on performance tuning, architecture, fault tolerance, and production reliability. Data science interviews focus on MLlib, feature engineering, and model serving. If you are interviewing for a DE role, prioritize shuffle optimization, join strategies, and memory management over ML pipelines.

Q: What are the core concepts of Apache Spark interviewers always ask about?

Driver and executor architecture, the DAG (directed acyclic graph) and stage execution model, narrow vs wide transformations, the shuffle, lazy evaluation, the Catalyst optimizer, Adaptive Query Execution, and unified memory management. Master these eight and most Spark interview questions become straightforward.

Q: What is lazy evaluation in Spark and why does it matter?

Spark builds a DAG of transformations but does not execute anything until an action (count, show, collect, write) is called. This lets the Catalyst optimizer rearrange and combine operations across the entire query before execution. Interviewers test this because it explains why a single show() can take longer than a write() with many transformations.

Q: What are the main Spark transformations interviewers ask about?

Narrow: map, filter, flatMap, union, sample. Wide: groupByKey, reduceByKey, join, cogroup, repartition, distinct. Wide transformations create shuffle boundaries and stage breaks. Strong candidates can predict whether a transformation creates a shuffle just by looking at the code.

Q: What is the difference between Apache Spark and Apache Flink?

Spark is batch-first with streaming layered on top via micro-batching (Structured Streaming). Flink is streaming-first with batch as a special case. Flink offers lower latency for true event-time streaming, while Spark is more mature for batch and ML workloads. Most data engineering interviews focus on Spark; Flink comes up for low-latency streaming roles.

Spark processes over 100 PB daily at companies like Netflix, Uber, and Apple. Interview questions focus on architecture, execution model, and performance tuning. Knowing how to write a query is table stakes. Knowing why it runs slowly is what gets you the offer.

TL;DR: Spark Interview Questions

Apache Spark is a distributed processing engine built on a driver/executor architecture that compiles transformations into a DAG and executes them lazily. Data engineering interviews focus on six areas: driver and executor architecture, DAG and stages, narrow vs wide transformations and shuffles, join strategies, Catalyst and Adaptive Query Execution, and memory tuning.

For most roles you need to walk through what happens when spark-submit runs, explain the difference between narrow and wide transformations, compare broadcast hash join vs sort-merge join, and diagnose a job that OOMs in production. Senior roles add Catalyst plan reading, AQE configuration, dynamic resource allocation, and integration with Delta Lake or Iceberg.

Spark Join Strategies: Decision Matrix

Knowing when Spark picks each join strategy is one of the most common Spark interview questions. Memorize this table.

Join strategy	When Spark picks it	Shuffle?	Best for
Broadcast hash join	One side < spark.sql.autoBroadcastJoinThreshold (default 10MB)	None	Joining a large fact with a small dim table
Sort-merge join	Default for large-to-large equi-joins; one or both sides too big to broadcast	Both sides	Two large tables on the same key
Shuffle hash join	One side fits in memory after partitioning; AQE may pick this when sort-merge would be cheaper	Both sides	Medium-large + medium-small joins
Broadcast nested loop	Non-equi joins (range/inequality) where one side is small	None	BETWEEN, LIKE, custom predicate joins

What Spark Interviewers Expect

Spark interview questions test three layers of understanding. The first is whether you know the architecture: drivers, executors, stages, and tasks. The second is whether you can write efficient transformations. The third is whether you can diagnose and fix production performance problems.

Junior

Junior candidates

Should explain the execution model clearly: how a DAG is built, what triggers execution, and why shuffles are expensive.

Mid

Mid-level candidates

Need to compare join strategies, explain AQE, and discuss partitioning choices. You should be comfortable reading a Spark physical plan.

Senior

Senior candidates

Are expected to design Spark applications for reliability and efficiency at scale. This includes memory tuning, dynamic allocation, fault tolerance tradeoffs, and integration with storage layers like Delta Lake or Iceberg.

Spark Core Concepts Interviewers Test

Six concepts surface in nearly every Spark interview. If you can explain each one in your own words and connect it to a real performance problem, most questions become straightforward.

Driver and Executor Architecture: the driver coordinates execution, parses your code, builds the DAG, and schedules tasks. Executors run those tasks and store data. Interviewers want you to explain what happens when the driver runs out of memory (usually a collect() or broadcast gone wrong) versus when an executor OOMs (usually a skewed partition or too much cached data).
DAG and Stage Execution: Spark builds a Directed Acyclic Graph of transformations. Shuffle boundaries create stage breaks. Within a stage, tasks run in parallel on partitions. You need this to read the Spark UI and diagnose why a job is slow.
Narrow vs Wide Transformations: narrow transformations (map, filter, union) process each partition independently. Wide transformations (groupBy, join, repartition) require data movement across the cluster. Every wide transformation creates a shuffle, which writes intermediate data to disk and transfers it over the network.
Catalyst Optimizer: Catalyst converts logical plans into optimized physical plans. It applies 100+ optimization rules including predicate pushdown, column pruning, constant folding, and join reordering. Understanding Catalyst explains why DataFrame operations outperform equivalent RDD code and why UDFs block optimization.
Adaptive Query Execution (AQE): AQE re-optimizes the query plan at runtime based on actual data statistics. It coalesces small partitions after shuffles, switches join strategies when one side is smaller than expected, and optimizes skew joins. Available in Spark 3.0+ and enabled by default in 3.2+.
Memory Management: executor memory is split roughly 60% to the unified pool (spark.memory.fraction = 0.6), shared between execution (shuffles, joins, sorts) and storage (cache). The unified model lets one borrow from the other. Interviewers test whether you can tune spark.executor.memory, spark.memory.fraction, and spark.memory.storageFraction to solve specific problems.

Spark Interview Questions with Guidance

Ten questions that come up in real Spark loops, each with what a strong answer covers.

Walk through what happens when you submit a Spark application. Cover every component from spark-submit to task completion.

spark-submit sends the application to the cluster manager (YARN, K8s, or standalone). The cluster manager allocates a container for the driver. The driver initializes SparkContext, negotiates executor containers, and sends the application JAR to executors. When an action triggers execution, the driver builds a DAG, splits it into stages at shuffle boundaries, and creates tasks (one per partition per stage). The scheduler sends tasks to executors. Executors run the tasks, read/write shuffle data, and report results back to the driver.

Explain the difference between narrow and wide dependencies with concrete examples. Why does this distinction matter?

Narrow dependencies mean each parent partition is used by at most one child partition: map(), filter(), union(). Wide dependencies mean multiple child partitions depend on a single parent: groupByKey(), reduceByKey(), join(). This matters because wide dependencies create shuffle boundaries, which are the most expensive operation in Spark. They force data serialization, disk writes, and network transfer. A strong answer connects this to stage boundaries in the DAG and explains how minimizing shuffles improves performance.

What are the different join strategies Spark can use? How does it decide which one to pick?

Broadcast hash join: sends the small table to all executors, no shuffle needed. Sort-merge join: both sides are shuffled and sorted by the join key, then merged. Shuffle hash join: both sides are shuffled by join key, the smaller side builds a hash table per partition. Broadcast nested loop join: fallback for non-equi joins with a small table. Spark decides based on table sizes, join type, and hints. autoBroadcastJoinThreshold controls the broadcast cutoff. AQE can switch strategies at runtime if actual sizes differ from estimates.

A Spark job runs fine on small data but OOMs on production data. Walk through your debugging approach.

First, check the Spark UI to identify which stage fails. If the driver OOMs, look for collect(), broadcast of large tables, or large accumulator values. If an executor OOMs, check for skewed partitions (one task processing far more data), insufficient executor memory, or excessive caching. Fixes include increasing executor memory, repartitioning skewed data, salting join keys, replacing collect() with write operations, or reducing cache usage. A strong answer mentions checking GC logs and the difference between on-heap and off-heap memory.

Explain Spark SQL and how it relates to the DataFrame API. Can they be used together?

Spark SQL lets you run SQL queries on registered temporary views. The DataFrame API provides programmatic access to the same operations. Both compile to the same logical plan and go through the Catalyst optimizer. You can freely mix them: create a temp view from a DataFrame, query it with SQL, and convert the result back to a DataFrame. A strong answer notes that Spark SQL supports ANSI SQL, window functions, CTEs, and subqueries, and that performance is identical between SQL and DataFrame approaches.

What is speculative execution in Spark? When does it help and when does it hurt?

Speculative execution launches duplicate copies of slow tasks on other executors. If the duplicate finishes first, Spark kills the original. It helps when slowness is caused by a bad node (disk failure, network congestion). It hurts when slowness is caused by data skew, because the duplicate task gets the same skewed partition and is just as slow. It also wastes resources by running redundant tasks. A strong answer mentions spark.speculation.quantile and spark.speculation.multiplier as tuning knobs.

How does Spark handle fault tolerance? What happens when an executor dies mid-job?

Spark tracks the lineage (DAG) of every RDD/DataFrame. When an executor dies, the driver reassigns its tasks to surviving executors. Those executors recompute the lost partitions by replaying the lineage from the last shuffle checkpoint. Shuffle data written to disk by earlier stages is preserved; only the current stage re-executes. If shuffle files are also lost (the node died), Spark re-runs the upstream stages. A strong answer mentions that caching provides a shortcut in the lineage but cached data on the dead executor is lost.

What is dynamic resource allocation and when should you enable it?

Dynamic allocation lets Spark add and release executors based on workload. When tasks are queued, Spark requests more executors. When executors are idle, Spark releases them. This is useful for long-running jobs with variable load (ETL pipelines that read many small tables then join one large table). It saves cluster resources by not holding executors you do not need. A strong answer mentions the external shuffle service requirement: without it, Spark cannot release executors that hold shuffle data.

Explain the difference between client mode and cluster mode in Spark. When would you choose each?

In client mode, the driver runs on the machine that submitted the job. In cluster mode, the driver runs inside the cluster on a worker node. Client mode is useful for interactive development (notebooks, spark-shell) because you can see stdout directly. Cluster mode is better for production because the driver benefits from cluster resources and network proximity to executors. A strong answer notes that in client mode, the driver's machine becomes a single point of failure and a network bottleneck.

How would you optimize a Spark job that reads from a data lake with thousands of small files?

Small files create excessive task overhead because Spark creates one task per file by default. Solutions: use coalesce on the input (spark.sql.files.maxPartitionBytes), enable file listing parallelism (spark.sql.sources.parallelPartitionDiscovery.threshold), compact small files into larger ones as a preprocessing step, or use Delta Lake / Iceberg which handle compaction automatically. A strong answer mentions that the driver also suffers because it must list and plan thousands of files, which can cause driver OOM or long planning times.

Worked Example: Reading a Spark Explain Plan

== Physical Plan ==
*(3) SortMergeJoin [customer_id], [customer_id], Inner
:- *(1) Sort [customer_id ASC], false, 0
:  +- Exchange hashpartitioning(customer_id, 200)
:     +- *(1) Filter isnotnull(customer_id)
:        +- *(1) Scan parquet orders [customer_id, amount]
+- *(2) Sort [customer_id ASC], false, 0
   +- Exchange hashpartitioning(customer_id, 200)
      +- *(2) Filter isnotnull(customer_id)
         +- *(2) Scan parquet customers [customer_id, name]

# Read bottom-up:
# 1. Both tables are scanned with column pruning
# 2. Null keys are filtered (cannot match in inner join)
# 3. Exchange = shuffle by customer_id into 200 partitions
# 4. Both sides are sorted within each partition
# 5. SortMergeJoin merges the sorted streams

The key insight: two Exchange nodes mean two shuffles. If the customers table is small enough, a broadcast join would eliminate both shuffles. Mentioning this optimization in an interview shows you can read plans and reason about alternatives.

Common Spark Interview Mistakes

Patterns that flag a candidate as someone who has read about Spark but not run it under load.

Explaining what Spark does without understanding why: knowing that groupByKey is bad without being able to explain the memory implications
Confusing shuffle partitions (spark.sql.shuffle.partitions, default 200) with input partitions (the number of files or blocks read)
Claiming Spark is always faster than single-node processing, ignoring the overhead of serialization and network transfer for small datasets
Not mentioning AQE when discussing join optimization, which is a standard feature in Spark 3.x
Treating RDDs and DataFrames as interchangeable without discussing Catalyst optimization and Tungsten execution
Ignoring executor memory overhead (spark.executor.memoryOverhead), which causes container kills on YARN/K8s

Spark Interview Questions FAQ

How deeply should I understand Spark internals for a data engineering interview?+

Know the driver/executor model, how shuffles work, what stages and tasks are, and how to read the Spark UI. You do not need to understand the scheduler implementation or network protocol details. Focus on being able to diagnose performance problems and explain your reasoning.

Do companies still ask about RDDs?+

Rarely as a primary topic, but RDD knowledge demonstrates depth. You should understand that DataFrames are built on top of RDDs, that RDD lineage provides fault tolerance, and that RDDs bypass Catalyst optimization. Some companies ask about RDD operations to test fundamental understanding.

Should I study Spark with Scala or Python for interviews?+

PySpark accounts for roughly 70% of Spark API usage based on GitHub activity, with Scala at about 25%. Study whichever the job description uses. If neither is specified, PySpark is the safer choice because it is more common in data engineering roles. The concepts (DAGs, shuffles, partitioning) are language-agnostic.

What is the difference between Spark interview questions for data engineers versus data scientists?+

Data engineering interviews focus on performance tuning, architecture, fault tolerance, and production reliability. Data science interviews focus on MLlib, feature engineering, and model serving. If you are interviewing for a DE role, prioritize shuffle optimization, join strategies, and memory management over ML pipelines.

What are the core concepts of Apache Spark interviewers always ask about?+

Driver and executor architecture, the DAG (directed acyclic graph) and stage execution model, narrow vs wide transformations, the shuffle, lazy evaluation, the Catalyst optimizer, Adaptive Query Execution, and unified memory management. Master these eight and most Spark interview questions become straightforward.

What is lazy evaluation in Spark and why does it matter?+

Spark builds a DAG of transformations but does not execute anything until an action (count, show, collect, write) is called. This lets the Catalyst optimizer rearrange and combine operations across the entire query before execution. Interviewers test this because it explains why a single show() can take longer than a write() with many transformations.

What are the main Spark transformations interviewers ask about?+

Narrow: map, filter, flatMap, union, sample. Wide: groupByKey, reduceByKey, join, cogroup, repartition, distinct. Wide transformations create shuffle boundaries and stage breaks. Strong candidates can predict whether a transformation creates a shuffle just by looking at the code.

What is the difference between Apache Spark and Apache Flink?+

Spark is batch-first with streaming layered on top via micro-batching (Structured Streaming). Flink is streaming-first with batch as a special case. Flink offers lower latency for true event-time streaming, while Spark is more mature for batch and ML workloads. Most data engineering interviews focus on Spark; Flink comes up for low-latency streaming roles.

02 / Why practice

Practice Spark Interview Questions

01
Active recall beats re-reading by 50%
Cognitive-science meta-reviews (Dunlosky et al., 2013) rank practice testing as a top-tier study technique, while re-reading and highlighting rank near the bottom
02
76% of hiring managers reject on the coding task, not the resume
From HackerRank's 2024 Developer Skills Report. Candidates who look strong on paper still fail the live screen if they haven't done timed, executable practice
03
Five problem shapes cover 80% of data engineer loops
Parsing and reshaping, sessionization, dedup with tie-breaks, streaming aggregation, top-N-per-group. Writing them by hand turns the unfamiliar into pattern recognition

Open the problems

Related Guides

PySpark Interview Questions→

DataFrame API, Python UDF optimization, and PySpark-specific patterns

Kafka Interview Questions→

Brokers, consumer groups, and exactly-once semantics for streaming pipelines

DE Interview Prep→

All five data engineering interview rounds covered