Data Engineering Interview Prep

Apache Spark Interview Questions for Data Engineers (2026)

Spark powers the majority of large-scale data processing. Interview questions focus on architecture, execution model, and performance tuning. Knowing how to write a query is table stakes. Knowing why it runs slowly is what gets you the offer.

Covers Spark 3.x features including Adaptive Query Execution, dynamic partition pruning, and the unified memory model.

What Interviewers Expect

Spark interview questions test three layers of understanding. The first is whether you know the architecture: drivers, executors, stages, and tasks. The second is whether you can write efficient transformations. The third is whether you can diagnose and fix production performance problems.

Junior candidates should explain the execution model clearly: how a DAG is built, what triggers execution, and why shuffles are expensive.

Mid-level candidates need to compare join strategies, explain AQE, and discuss partitioning choices. You should be comfortable reading a Spark physical plan.

Senior candidates are expected to design Spark applications for reliability and efficiency at scale. This includes memory tuning, dynamic allocation, fault tolerance tradeoffs, and integration with storage layers like Delta Lake or Iceberg.

Core Concepts Interviewers Test

Driver and Executor Architecture

The driver coordinates execution: it parses your code, builds the DAG, and schedules tasks. Executors run those tasks and store data. Interviewers want you to explain what happens when the driver runs out of memory (usually a collect() or broadcast gone wrong) versus when an executor OOMs (usually a skewed partition or too much cached data).

DAG and Stage Execution

Spark builds a Directed Acyclic Graph of transformations. Shuffle boundaries create stage breaks. Within a stage, tasks run in parallel on partitions. You need this to read the Spark UI and diagnosing why a job is slow.

Narrow vs Wide Transformations

Narrow transformations (map, filter, union) process each partition independently. Wide transformations (groupBy, join, repartition) require data movement across the cluster. Every wide transformation creates a shuffle, which writes intermediate data to disk and transfers it over the network.

Catalyst Optimizer

Catalyst converts logical plans into optimized physical plans. It applies predicate pushdown, column pruning, constant folding, and join reordering. Understanding Catalyst explains why DataFrame operations outperform equivalent RDD code and why UDFs block optimization.

Adaptive Query Execution (AQE)

AQE re-optimizes the query plan at runtime based on actual data statistics. It coalesces small partitions after shuffles, switches join strategies when one side is smaller than expected, and optimizes skew joins. Available in Spark 3.0+ and enabled by default in 3.2+.

Memory Management

Spark divides executor memory into execution memory (shuffles, joins, sorts) and storage memory (cache). The unified memory model lets one borrow from the other. Interviewers test whether you can tune spark.executor.memory, spark.memory.fraction, and spark.memory.storageFraction to solve specific problems.

Spark Interview Questions with Guidance

Q1

Walk through what happens when you submit a Spark application. Cover every component from spark-submit to task completion.

A strong answer includes:

spark-submit sends the application to the cluster manager (YARN, K8s, or standalone). The cluster manager allocates a container for the driver. The driver initializes SparkContext, negotiates executor containers, and sends the application JAR to executors. When an action triggers execution, the driver builds a DAG, splits it into stages at shuffle boundaries, and creates tasks (one per partition per stage). The scheduler sends tasks to executors. Executors run the tasks, read/write shuffle data, and report results back to the driver.

Q2

Explain the difference between narrow and wide dependencies with concrete examples. Why does this distinction matter?

A strong answer includes:

Narrow dependencies mean each parent partition is used by at most one child partition: map(), filter(), union(). Wide dependencies mean multiple child partitions depend on a single parent: groupByKey(), reduceByKey(), join(). This matters because wide dependencies create shuffle boundaries, which are the most expensive operation in Spark. They force data serialization, disk writes, and network transfer. A strong answer connects this to stage boundaries in the DAG and explains how minimizing shuffles improves performance.

Q3

What are the different join strategies Spark can use? How does it decide which one to pick?

A strong answer includes:

Broadcast hash join: sends the small table to all executors, no shuffle needed. Sort-merge join: both sides are shuffled and sorted by the join key, then merged. Shuffle hash join: both sides are shuffled by join key, the smaller side builds a hash table per partition. Broadcast nested loop join: fallback for non-equi joins with a small table. Spark decides based on table sizes, join type, and hints. autoBroadcastJoinThreshold controls the broadcast cutoff. AQE can switch strategies at runtime if actual sizes differ from estimates.

Q4

A Spark job runs fine on small data but OOMs on production data. Walk through your debugging approach.

A strong answer includes:

First, check the Spark UI to identify which stage fails. If the driver OOMs, look for collect(), broadcast of large tables, or large accumulator values. If an executor OOMs, check for skewed partitions (one task processing far more data), insufficient executor memory, or excessive caching. Fixes include increasing executor memory, repartitioning skewed data, salting join keys, replacing collect() with write operations, or reducing cache usage. A strong answer mentions checking GC logs and the difference between on-heap and off-heap memory.

Q5

Explain Spark SQL and how it relates to the DataFrame API. Can they be used together?

A strong answer includes:

Spark SQL lets you run SQL queries on registered temporary views. The DataFrame API provides programmatic access to the same operations. Both compile to the same logical plan and go through the Catalyst optimizer. You can freely mix them: create a temp view from a DataFrame, query it with SQL, and convert the result back to a DataFrame. A strong answer notes that Spark SQL supports ANSI SQL, window functions, CTEs, and subqueries, and that performance is identical between SQL and DataFrame approaches.

Q6

What is speculative execution in Spark? When does it help and when does it hurt?

A strong answer includes:

Speculative execution launches duplicate copies of slow tasks on other executors. If the duplicate finishes first, Spark kills the original. It helps when slowness is caused by a bad node (disk failure, network congestion). It hurts when slowness is caused by data skew, because the duplicate task gets the same skewed partition and is just as slow. It also wastes resources by running redundant tasks. A strong answer mentions spark.speculation.quantile and spark.speculation.multiplier as tuning knobs.

Q7

How does Spark handle fault tolerance? What happens when an executor dies mid-job?

A strong answer includes:

Spark tracks the lineage (DAG) of every RDD/DataFrame. When an executor dies, the driver reassigns its tasks to surviving executors. Those executors recompute the lost partitions by replaying the lineage from the last shuffle checkpoint. Shuffle data written to disk by earlier stages is preserved; only the current stage re-executes. If shuffle files are also lost (the node died), Spark re-runs the upstream stages. A strong answer mentions that caching provides a shortcut in the lineage but cached data on the dead executor is lost.

Q8

What is dynamic resource allocation and when should you enable it?

A strong answer includes:

Dynamic allocation lets Spark add and release executors based on workload. When tasks are queued, Spark requests more executors. When executors are idle, Spark releases them. This is useful for long-running jobs with variable load (ETL pipelines that read many small tables then join one large table). It saves cluster resources by not holding executors you do not need. A strong answer mentions the external shuffle service requirement: without it, Spark cannot release executors that hold shuffle data.

Q9

Explain the difference between client mode and cluster mode in Spark. When would you choose each?

A strong answer includes:

In client mode, the driver runs on the machine that submitted the job. In cluster mode, the driver runs inside the cluster on a worker node. Client mode is useful for interactive development (notebooks, spark-shell) because you can see stdout directly. Cluster mode is better for production because the driver benefits from cluster resources and network proximity to executors. A strong answer notes that in client mode, the driver's machine becomes a single point of failure and a network bottleneck.

Q10

How would you optimize a Spark job that reads from a data lake with thousands of small files?

A strong answer includes:

Small files create excessive task overhead because Spark creates one task per file by default. Solutions: use coalesce on the input (spark.sql.files.maxPartitionBytes), enable file listing parallelism (spark.sql.sources.parallelPartitionDiscovery.threshold), compact small files into larger ones as a preprocessing step, or use Delta Lake / Iceberg which handle compaction automatically. A strong answer mentions that the driver also suffers because it must list and plan thousands of files, which can cause driver OOM or long planning times.

Worked Example: Reading a Spark Explain Plan

Interviewers often show you a physical plan and ask what it means. Here is a simple join plan and how to read it.

== Physical Plan ==
*(3) SortMergeJoin [customer_id], [customer_id], Inner
:- *(1) Sort [customer_id ASC], false, 0
:  +- Exchange hashpartitioning(customer_id, 200)
:     +- *(1) Filter isnotnull(customer_id)
:        +- *(1) Scan parquet orders [customer_id, amount]
+- *(2) Sort [customer_id ASC], false, 0
   +- Exchange hashpartitioning(customer_id, 200)
      +- *(2) Filter isnotnull(customer_id)
         +- *(2) Scan parquet customers [customer_id, name]

-- Read bottom-up:
-- 1. Both tables are scanned with column pruning
-- 2. Null keys are filtered (cannot match in inner join)
-- 3. Exchange = shuffle by customer_id into 200 partitions
-- 4. Both sides are sorted within each partition
-- 5. SortMergeJoin merges the sorted streams

The key insight: two Exchange nodes mean two shuffles. If the customers table is small enough, a broadcast join would eliminate both shuffles. Mentioning this optimization in an interview shows you can read plans and reason about alternatives.

Common Mistakes in Spark Interviews

Explaining what Spark does without understanding why: knowing that groupByKey is bad without being able to explain the memory implications

Confusing shuffle partitions (spark.sql.shuffle.partitions) with input partitions (the number of files or blocks read)

Claiming Spark is always faster than single-node processing, ignoring the overhead of serialization and network transfer for small datasets

Not mentioning AQE when discussing join optimization, which is a standard feature in Spark 3.x

Treating RDDs and DataFrames as interchangeable without discussing Catalyst optimization and Tungsten execution

Ignoring executor memory overhead (spark.executor.memoryOverhead), which causes container kills on YARN/K8s

Spark Interview Questions FAQ

How deeply should I understand Spark internals for a data engineering interview?+
Know the driver/executor model, how shuffles work, what stages and tasks are, and how to read the Spark UI. You do not need to understand the scheduler implementation or network protocol details. Focus on being able to diagnose performance problems and explain your reasoning.
Do companies still ask about RDDs?+
Rarely as a primary topic, but RDD knowledge demonstrates depth. You should understand that DataFrames are built on top of RDDs, that RDD lineage provides fault tolerance, and that RDDs bypass Catalyst optimization. Some companies ask about RDD operations to test fundamental understanding.
Should I study Spark with Scala or Python for interviews?+
Study whichever the job description uses. If neither is specified, PySpark is the safer choice because it is more common in data engineering roles. The concepts (DAGs, shuffles, partitioning) are language-agnostic, so studying one prepares you for questions about the other.
What is the difference between Spark interview questions for data engineers versus data scientists?+
Data engineering interviews focus on performance tuning, architecture, fault tolerance, and production reliability. Data science interviews focus on MLlib, feature engineering, and model serving. If you are interviewing for a DE role, prioritize shuffle optimization, join strategies, and memory management over ML pipelines.

Practice Spark Interview Questions

Build the distributed systems intuition that interviewers test for. Practice with real execution environments and immediate feedback.