Distributed Compute: Beginner

'Explain how Spark works.' This is the single most-asked question in data engineering interviews, with 183 questions across 70+ companies. The interviewer wants you to explain drivers and executors, why shuffles are expensive, and what causes a Spark job to be slow. Here is exactly how to answer.

What you will be able to do

Explain Spark's driver-executor architecture in one clear paragraph

Answer 'What is a shuffle and why is it expensive?' without hesitating

Name the top 3 reasons a Spark job is slow

Spark Execution Model

Daily Life

Interviews

Explain Spark architecture clearly and confidently

Interview Trigger Phrases

When you hear these in an interview, this is the concept being tested

▸"Explain how Spark works."
▸"What is the difference between a driver and an executor?"
▸"What happens when you submit a Spark job?"

What They Want to Hear

'Spark splits work across a cluster. The driver is the coordinator: it plans the work, divides it into tasks, and sends those tasks to executors. Executors are the workers: each one processes a partition of the data in parallel. The key insight is that Spark is lazy. It builds a plan (the DAG) but does not execute anything until you call an action like .count() or .write().' That is the answer. Driver plans, executors execute, nothing happens until an action triggers it.

Source

Driver

Consumer

Executor 1

Consumer

Executor 2

Consumer

Executor 3

What to Whiteboard

The Vocabulary to Use

Term	What It Is	One-Liner for Interviews
Driver	The coordinator process	Plans the DAG and assigns tasks to executors
Executor	A worker process on a cluster node	Processes one or more partitions in parallel
Partition	A chunk of data	Each partition is processed by one task on one executor
Task	A unit of work	One task processes one partition through one stage
Stage	A group of tasks with no shuffle between them	Stage boundaries are created by shuffle operations
DAG	Directed acyclic graph of operations	Spark's execution plan; built lazily, executed on action

The Curveball Follow-ups

After your initial answer, expect these probes

▸"What triggers Spark to actually run?" An action. Transformations like .filter() and .join() are lazy and just build the DAG. Actions like .count(), .collect(), and .write() trigger execution.
▸"What happens if an executor fails?" The driver re-assigns the failed tasks to other executors. If the data partition was lost, Spark re-computes it from the source using the DAG lineage.
▸"What is the DAG?" A directed acyclic graph of all transformations. Spark uses it to optimize the execution plan before running anything. Think of it as a recipe that Spark reads before cooking.

KEY TAKEAWAYS

Say: 'Driver plans, executors execute, each partition is processed by one task in parallel.'

Spark is lazy: transformations build the DAG, actions trigger execution

Know the vocabulary: driver, executor, partition, task, stage, DAG

Distributed Primitives

Daily Life

Interviews

Explain transformations, actions, and why narrow vs wide matters

Interview Trigger Phrases

When you hear these in an interview, this is the concept being tested

▸"What is a transformation vs an action?"
▸"What does lazy evaluation mean in Spark?"
▸"What is a narrow vs wide transformation?"

What They Want to Hear

'A transformation defines a new dataset from an existing one without executing anything. An action triggers execution and returns a result. Narrow transformations like filter and map process each partition independently. Wide transformations like groupBy and join require data to move between executors, which creates a shuffle.' That is the answer. Narrow = no data movement. Wide = shuffle. This distinction is the foundation of Spark performance.

•Narrow Transformations

filter(), map(), select(), withColumn()
Each partition processed independently
No data moves between executors
Fast: no network I/O

•Wide Transformations

groupBy(), join(), repartition(), distinct()
Data must move between executors
Creates a shuffle (network + disk I/O)
Slow: the #1 performance bottleneck

The Curveball Follow-ups

After your initial answer, expect these probes

▸"Why does it matter if a transformation is narrow or wide?" Wide transformations create stage boundaries. Data must be serialized, sent over the network, and deserialized. This is the most expensive operation in Spark.
▸"What is the Catalyst optimizer?" Spark's query planner. It rearranges transformations for efficiency: pushes filters before joins, chooses join strategies, and eliminates unnecessary columns. You write the logic; Catalyst optimizes the execution.
▸"Name an action.".count(), .collect(), .show(), .write(). These trigger the DAG to execute. Until an action is called, Spark has done nothing.

KEY TAKEAWAYS

Say: 'Narrow transformations need no data movement. Wide transformations cause shuffles. Minimize wide operations.'

Transformations are lazy (build the DAG). Actions trigger execution.

The Catalyst optimizer rearranges your code for performance. Trust it, but know what it does.

Shuffle Operations

Daily Life

Interviews

Explain shuffles and how to avoid them with broadcast joins

Interview Trigger Phrases

When you hear these in an interview, this is the concept being tested

▸"What is a shuffle in Spark?"
▸"Why is this join slow?"
▸"How do you avoid shuffles?"

What They Want to Hear

'A shuffle redistributes data across executors. It happens when Spark needs to group or join data by a key, and the matching rows are spread across different partitions. Shuffles are expensive because every executor must write its data to disk, send it over the network, and every receiving executor must read and merge it. The number one way to avoid unnecessary shuffles is broadcast joins: if one side of the join is small enough to fit in memory, broadcast it to all executors so no shuffle is needed.' That is the answer. Shuffle = redistribute = expensive. Broadcast = avoid the shuffle.

Source

Executor 1

Source

Executor 2

Transform

Shuffle

Consumer

Executor 1 2

Consumer

Executor 2 2

What to Whiteboard

The Curveball Follow-ups

After your initial answer, expect these probes

▸"When is a broadcast join not possible?" When both sides of the join are too large to fit in executor memory. The default broadcast threshold in Spark is 10MB. You can increase it, but broadcasting a 1GB table wastes memory on every executor.
▸"How do you know if a shuffle is happening?" Check the Spark UI. Stage boundaries in the DAG visualization indicate shuffles. The 'Shuffle Read' and 'Shuffle Write' metrics tell you how much data moved.
▸"What causes the most shuffles?"groupBy, join, distinct, repartition, and window functions with PARTITION BY. Anytime Spark needs all rows with the same key on the same executor.

KEY TAKEAWAYS

Say: 'Shuffles redistribute data across executors. They are the most expensive operation in Spark.'

Broadcast join: send the small table to all executors. No shuffle needed.

Check the Spark UI for shuffle read/write metrics to find bottlenecks

Memory Management

Daily Life

Interviews

Diagnose memory problems in Spark with the right vocabulary

Interview Trigger Phrases

When you hear these in an interview, this is the concept being tested

▸"Your Spark job is running out of memory. What do you check?"
▸"How does Spark use memory?"
▸"What does spill-to-disk mean?"

What They Want to Hear

'Each executor gets a fixed amount of memory, split between storage (caching data) and execution (shuffles, joins, sorts). When execution memory runs out, Spark spills data to disk, which is much slower. When the disk fills up too, the job fails with an out-of-memory error. The fix depends on the cause: too few partitions means each one is too large, so repartition to create smaller chunks. Too much data cached means storage is crowding out execution, so unpersist unused caches.' That is the answer. Memory splits into storage and execution. Spill to disk is the warning sign.

Symptom	Likely Cause	Fix
Spill to disk	Partitions too large	Increase partition count with repartition()
OOM on executor	Single partition too large (data skew)	Salting or isolating the large key
OOM on driver	Calling .collect() on large dataset	Never collect() large data. Use .write() instead.
Slow joins	Not enough memory for hash table	Broadcast the smaller table or increase executor memory

The Curveball Follow-ups

After your initial answer, expect these probes

▸"How do you decide how much memory to give each executor?" Start with the cluster default (usually 4-8GB). If you see spill-to-disk, increase memory or increase the number of partitions to make each one smaller. The 5-core rule: 5 cores per executor, each processing a partition.
▸"What is the difference between persist() and cache()?"cache() stores data in memory only. persist() lets you choose the storage level: memory-only, memory-and-disk, or disk-only. Use persist(MEMORY_AND_DISK) for safety.
▸"When should you cache data?" Only when the same dataset is used in multiple actions. Caching a dataset used once wastes memory. Caching a dataset used in 5 different joins saves 4 re-computations.

KEY TAKEAWAYS

Say: 'Executor memory splits between storage and execution. Spill to disk means partitions are too large.'

OOM on driver = you called .collect() on too much data. Never collect large datasets.

Cache only when the same data is reused. Caching once-used data wastes memory.

Small File Problem

Daily Life

Interviews

Explain the small file problem and fix it with coalesce or compaction

Interview Trigger Phrases

When you hear these in an interview, this is the concept being tested

▸"Your data lake has millions of small files. What is the impact?"
▸"What is the optimal file size for Parquet?"
▸"coalesce vs repartition: what is the difference?"

What They Want to Hear

'Too many small files kill read performance. Each file requires a separate metadata lookup, a separate file open, and a separate read request. Thousands of 1KB files are far slower to read than one 128MB file with the same data. The target file size is 128MB to 256MB. To fix small files, I use coalesce() to reduce the number of output partitions before writing, or run a compaction job that rewrites small files into larger ones.' That is the answer. Target size, the problem, and two fixes.

•coalesce()

Reduces partitions without a shuffle
Merges partitions on the same executor
Can only reduce, never increase
Use when: reducing output files after filter

•repartition()

Redistributes data evenly with a shuffle
Can increase or decrease partition count
More expensive due to the shuffle
Use when: fixing skewed partitions or increasing parallelism

The Curveball Follow-ups

After your initial answer, expect these probes

▸"What causes small files?" Over-partitioning (too many partition columns), streaming jobs writing one file per micro-batch, or a Spark job with too many output partitions.
▸"How do you fix small files that already exist?" Run a compaction job: read the small files, coalesce, and write back. Delta Lake and Iceberg have built-in OPTIMIZE commands that do this automatically.
▸"Is there a too-large file problem?" Yes. Files over 1GB are hard to process in parallel because each file is processed by one task. Aim for 128-256MB so each task finishes quickly.

KEY TAKEAWAYS

Say: 'Target 128-256MB per file. Use coalesce() to reduce output files. Run compaction for existing small files.'

coalesce() is free (no shuffle). repartition() is expensive (full shuffle). Pick accordingly.

Delta Lake OPTIMIZE and Iceberg compaction handle small files automatically

Answer the Spark architecture question that appears in every technical screen

Category: Pipeline Architecture
Difficulty: beginner
Duration: 20 minutes
Challenges: 0 hands-on challenges

Topics covered: Spark Execution Model, Distributed Primitives, Shuffle Operations, Memory Management, Small File Problem

Lesson Sections

Spark Execution Model (concepts: paSparkExecutionModel)
What They Want to Hear 'Spark splits work across a cluster. The driver is the coordinator: it plans the work, divides it into tasks, and sends those tasks to executors. Executors are the workers: each one processes a partition of the data in parallel. The key insight is that Spark is lazy. It builds a plan (the DAG) but does not execute anything until you call an action like .count() or .write().' That is the answer. Driver plans, executors execute, nothing happens until an action triggers it. T
Distributed Primitives (concepts: paSparkExecutionModel)
What They Want to Hear 'A transformation defines a new dataset from an existing one without executing anything. An action triggers execution and returns a result. Narrow transformations like filter and map process each partition independently. Wide transformations like groupBy and join require data to move between executors, which creates a shuffle.' That is the answer. Narrow = no data movement. Wide = shuffle. This distinction is the foundation of Spark performance.
Shuffle Operations (concepts: paShuffleOptimization)
What They Want to Hear 'A shuffle redistributes data across executors. It happens when Spark needs to group or join data by a key, and the matching rows are spread across different partitions. Shuffles are expensive because every executor must write its data to disk, send it over the network, and every receiving executor must read and merge it. The number one way to avoid unnecessary shuffles is broadcast joins: if one side of the join is small enough to fit in memory, broadcast it to all execut
Memory Management (concepts: paMemoryManagement)
What They Want to Hear 'Each executor gets a fixed amount of memory, split between storage (caching data) and execution (shuffles, joins, sorts). When execution memory runs out, Spark spills data to disk, which is much slower. When the disk fills up too, the job fails with an out-of-memory error. The fix depends on the cause: too few partitions means each one is too large, so repartition to create smaller chunks. Too much data cached means storage is crowding out execution, so unpersist unused c
Small File Problem (concepts: paSmallFiles)
What They Want to Hear 'Too many small files kill read performance. Each file requires a separate metadata lookup, a separate file open, and a separate read request. Thousands of 1KB files are far slower to read than one 128MB file with the same data. The target file size is 128MB to 256MB. To fix small files, I use coalesce() to reduce the number of output partitions before writing, or run a compaction job that rewrites small files into larger ones.' That is the answer. Target size, the problem