Distributed Compute: Beginner

'Explain how Spark works.' This is the single most-asked question in data engineering interviews, with 183 questions across 70+ companies. The interviewer wants you to explain drivers and executors, why shuffles are expensive, and what causes a Spark job to be slow. Here is exactly how to answer.

Spark Execution Model

Daily Life
Interviews

Explain Spark architecture clearly and confidently

Interview Trigger Phrases

When you hear these in an interview, this is the concept being tested

  • "Explain how Spark works."
  • "What is the difference between a driver and an executor?"
  • "What happens when you submit a Spark job?"

What They Want to Hear

'Spark splits work across a cluster. The driver is the coordinator: it plans the work, divides it into tasks, and sends those tasks to executors. Executors are the workers: each one processes a partition of the data in parallel. The key insight is that Spark is lazy. It builds a plan (the DAG) but does not execute anything until you call an action like .count() or .write().' That is the answer. Driver plans, executors execute, nothing happens until an action triggers it.
What to Whiteboard
taskstaskstasks
Driver
Plans the DAG, assigns tasks
Executor 1
Processes partitions 1-3
Executor 2
Processes partitions 4-6
Executor 3
Processes partitions 7-9

The Vocabulary to Use

TermWhat It IsOne-Liner for Interviews
DriverThe coordinator processPlans the DAG and assigns tasks to executors
ExecutorA worker process on a cluster nodeProcesses one or more partitions in parallel
PartitionA chunk of dataEach partition is processed by one task on one executor
TaskA unit of workOne task processes one partition through one stage
StageA group of tasks with no shuffle between themStage boundaries are created by shuffle operations
DAGDirected acyclic graph of operationsSpark's execution plan; built lazily, executed on action
The Curveball Follow-ups

After your initial answer, expect these probes

  • "What triggers Spark to actually run?" An action. Transformations like .filter() and .join() are lazy and just build the DAG. Actions like .count(), .collect(), and .write() trigger execution.
  • "What happens if an executor fails?" The driver re-assigns the failed tasks to other executors. If the data partition was lost, Spark re-computes it from the source using the DAG lineage.
  • "What is the DAG?" A directed acyclic graph of all transformations. Spark uses it to optimize the execution plan before running anything. Think of it as a recipe that Spark reads before cooking.
KEY TAKEAWAYS
Say: 'Driver plans, executors execute, each partition is processed by one task in parallel.'
Spark is lazy: transformations build the DAG, actions trigger execution
Know the vocabulary: driver, executor, partition, task, stage, DAG

Distributed Primitives

Daily Life
Interviews

Explain transformations, actions, and why narrow vs wide matters

Interview Trigger Phrases

When you hear these in an interview, this is the concept being tested

  • "What is a transformation vs an action?"
  • "What does lazy evaluation mean in Spark?"
  • "What is a narrow vs wide transformation?"

What They Want to Hear

'A transformation defines a new dataset from an existing one without executing anything. An action triggers execution and returns a result. Narrow transformations like filter and map process each partition independently. Wide transformations like groupBy and join require data to move between executors, which creates a shuffle.' That is the answer. Narrow = no data movement. Wide = shuffle. This distinction is the foundation of Spark performance.
Narrow Transformations
  • filter(), map(), select(), withColumn()
  • Each partition processed independently
  • No data moves between executors
  • Fast: no network I/O
Wide Transformations
  • groupBy(), join(), repartition(), distinct()
  • Data must move between executors
  • Creates a shuffle (network + disk I/O)
  • Slow: the #1 performance bottleneck
The Curveball Follow-ups

After your initial answer, expect these probes

  • "Why does it matter if a transformation is narrow or wide?" Wide transformations create stage boundaries. Data must be serialized, sent over the network, and deserialized. This is the most expensive operation in Spark.
  • "What is the Catalyst optimizer?" Spark's query planner. It rearranges transformations for efficiency: pushes filters before joins, chooses join strategies, and eliminates unnecessary columns. You write the logic; Catalyst optimizes the execution.
  • "Name an action.".count(), .collect(), .show(), .write(). These trigger the DAG to execute. Until an action is called, Spark has done nothing.
KEY TAKEAWAYS
Say: 'Narrow transformations need no data movement. Wide transformations cause shuffles. Minimize wide operations.'
Transformations are lazy (build the DAG). Actions trigger execution.
The Catalyst optimizer rearranges your code for performance. Trust it, but know what it does.

Shuffle Operations

Daily Life
Interviews

Explain shuffles and how to avoid them with broadcast joins

Interview Trigger Phrases

When you hear these in an interview, this is the concept being tested

  • "What is a shuffle in Spark?"
  • "Why is this join slow?"
  • "How do you avoid shuffles?"

What They Want to Hear

'A shuffle redistributes data across executors. It happens when Spark needs to group or join data by a key, and the matching rows are spread across different partitions. Shuffles are expensive because every executor must write its data to disk, send it over the network, and every receiving executor must read and merge it. The number one way to avoid unnecessary shuffles is broadcast joins: if one side of the join is small enough to fit in memory, broadcast it to all executors so no shuffle is needed.' That is the answer. Shuffle = redistribute = expensive. Broadcast = avoid the shuffle.
What to Whiteboard
write to diskwrite to diskread + mergeread + merge
Executor 1
Has keys A, B, C
Executor 2
Has keys A, D, E
Shuffle
Redistribute by key
Executor 1
Now has all A, B keys
Executor 2
Now has all C, D, E keys
The Curveball Follow-ups

After your initial answer, expect these probes

  • "When is a broadcast join not possible?" When both sides of the join are too large to fit in executor memory. The default broadcast threshold in Spark is 10MB. You can increase it, but broadcasting a 1GB table wastes memory on every executor.
  • "How do you know if a shuffle is happening?" Check the Spark UI. Stage boundaries in the DAG visualization indicate shuffles. The 'Shuffle Read' and 'Shuffle Write' metrics tell you how much data moved.
  • "What causes the most shuffles?"groupBy, join, distinct, repartition, and window functions with PARTITION BY. Anytime Spark needs all rows with the same key on the same executor.
KEY TAKEAWAYS
Say: 'Shuffles redistribute data across executors. They are the most expensive operation in Spark.'
Broadcast join: send the small table to all executors. No shuffle needed.
Check the Spark UI for shuffle read/write metrics to find bottlenecks

Memory Management

Daily Life
Interviews

Diagnose memory problems in Spark with the right vocabulary

Interview Trigger Phrases

When you hear these in an interview, this is the concept being tested

  • "Your Spark job is running out of memory. What do you check?"
  • "How does Spark use memory?"
  • "What does spill-to-disk mean?"

What They Want to Hear

'Each executor gets a fixed amount of memory, split between storage (caching data) and execution (shuffles, joins, sorts). When execution memory runs out, Spark spills data to disk, which is much slower. When the disk fills up too, the job fails with an out-of-memory error. The fix depends on the cause: too few partitions means each one is too large, so repartition to create smaller chunks. Too much data cached means storage is crowding out execution, so unpersist unused caches.' That is the answer. Memory splits into storage and execution. Spill to disk is the warning sign.
SymptomLikely CauseFix
Spill to diskPartitions too largeIncrease partition count with repartition()
OOM on executorSingle partition too large (data skew)Salting or isolating the large key
OOM on driverCalling .collect() on large datasetNever collect() large data. Use .write() instead.
Slow joinsNot enough memory for hash tableBroadcast the smaller table or increase executor memory
The Curveball Follow-ups

After your initial answer, expect these probes

  • "How do you decide how much memory to give each executor?" Start with the cluster default (usually 4-8GB). If you see spill-to-disk, increase memory or increase the number of partitions to make each one smaller. The 5-core rule: 5 cores per executor, each processing a partition.
  • "What is the difference between persist() and cache()?"cache() stores data in memory only. persist() lets you choose the storage level: memory-only, memory-and-disk, or disk-only. Use persist(MEMORY_AND_DISK) for safety.
  • "When should you cache data?" Only when the same dataset is used in multiple actions. Caching a dataset used once wastes memory. Caching a dataset used in 5 different joins saves 4 re-computations.
KEY TAKEAWAYS
Say: 'Executor memory splits between storage and execution. Spill to disk means partitions are too large.'
OOM on driver = you called .collect() on too much data. Never collect large datasets.
Cache only when the same data is reused. Caching once-used data wastes memory.

Small File Problem

Daily Life
Interviews

Explain the small file problem and fix it with coalesce or compaction

Interview Trigger Phrases

When you hear these in an interview, this is the concept being tested

  • "Your data lake has millions of small files. What is the impact?"
  • "What is the optimal file size for Parquet?"
  • "coalesce vs repartition: what is the difference?"

What They Want to Hear

'Too many small files kill read performance. Each file requires a separate metadata lookup, a separate file open, and a separate read request. Thousands of 1KB files are far slower to read than one 128MB file with the same data. The target file size is 128MB to 256MB. To fix small files, I use coalesce() to reduce the number of output partitions before writing, or run a compaction job that rewrites small files into larger ones.' That is the answer. Target size, the problem, and two fixes.
coalesce()
  • Reduces partitions without a shuffle
  • Merges partitions on the same executor
  • Can only reduce, never increase
  • Use when: reducing output files after filter
repartition()
  • Redistributes data evenly with a shuffle
  • Can increase or decrease partition count
  • More expensive due to the shuffle
  • Use when: fixing skewed partitions or increasing parallelism
The Curveball Follow-ups

After your initial answer, expect these probes

  • "What causes small files?" Over-partitioning (too many partition columns), streaming jobs writing one file per micro-batch, or a Spark job with too many output partitions.
  • "How do you fix small files that already exist?" Run a compaction job: read the small files, coalesce, and write back. Delta Lake and Iceberg have built-in OPTIMIZE commands that do this automatically.
  • "Is there a too-large file problem?" Yes. Files over 1GB are hard to process in parallel because each file is processed by one task. Aim for 128-256MB so each task finishes quickly.
KEY TAKEAWAYS
Say: 'Target 128-256MB per file. Use coalesce() to reduce output files. Run compaction for existing small files.'
coalesce() is free (no shuffle). repartition() is expensive (full shuffle). Pick accordingly.
Delta Lake OPTIMIZE and Iceberg compaction handle small files automatically

Answer the Spark architecture question that appears in every technical screen

Category
Pipeline Architecture
Difficulty
beginner
Duration
20 minutes
Challenges
0 hands-on challenges

Topics covered: Spark Execution Model, Distributed Primitives, Shuffle Operations, Memory Management, Small File Problem

Lesson Sections

  1. Spark Execution Model (concepts: paSparkExecutionModel)

    What They Want to Hear 'Spark splits work across a cluster. The driver is the coordinator: it plans the work, divides it into tasks, and sends those tasks to executors. Executors are the workers: each one processes a partition of the data in parallel. The key insight is that Spark is lazy. It builds a plan (the DAG) but does not execute anything until you call an action like .count() or .write().' That is the answer. Driver plans, executors execute, nothing happens until an action triggers it. T

  2. Distributed Primitives (concepts: paDistributedPrimitives)

    What They Want to Hear 'A transformation defines a new dataset from an existing one without executing anything. An action triggers execution and returns a result. Narrow transformations like filter and map process each partition independently. Wide transformations like groupBy and join require data to move between executors, which creates a shuffle.' That is the answer. Narrow = no data movement. Wide = shuffle. This distinction is the foundation of Spark performance.

  3. Shuffle Operations (concepts: paShuffleOptimization)

    What They Want to Hear 'A shuffle redistributes data across executors. It happens when Spark needs to group or join data by a key, and the matching rows are spread across different partitions. Shuffles are expensive because every executor must write its data to disk, send it over the network, and every receiving executor must read and merge it. The number one way to avoid unnecessary shuffles is broadcast joins: if one side of the join is small enough to fit in memory, broadcast it to all execut

  4. Memory Management (concepts: paMemoryManagement)

    What They Want to Hear 'Each executor gets a fixed amount of memory, split between storage (caching data) and execution (shuffles, joins, sorts). When execution memory runs out, Spark spills data to disk, which is much slower. When the disk fills up too, the job fails with an out-of-memory error. The fix depends on the cause: too few partitions means each one is too large, so repartition to create smaller chunks. Too much data cached means storage is crowding out execution, so unpersist unused c

  5. Small File Problem (concepts: paSmallFiles)

    What They Want to Hear 'Too many small files kill read performance. Each file requires a separate metadata lookup, a separate file open, and a separate read request. Thousands of 1KB files are far slower to read than one 128MB file with the same data. The target file size is 128MB to 256MB. To fix small files, I use coalesce() to reduce the number of output partitions before writing, or run a compaction job that rewrites small files into larger ones.' That is the answer. Target size, the problem