Distributed Compute: Beginner
'Explain how Spark works.' This is the single most-asked question in data engineering interviews, with 183 questions across 70+ companies. The interviewer wants you to explain drivers and executors, why shuffles are expensive, and what causes a Spark job to be slow. Here is exactly how to answer.
Spark Execution Model
Explain Spark architecture clearly and confidently
When you hear these in an interview, this is the concept being tested
- ▸"Explain how Spark works."
- ▸"What is the difference between a driver and an executor?"
- ▸"What happens when you submit a Spark job?"
What They Want to Hear
The Vocabulary to Use
| Term | What It Is | One-Liner for Interviews |
|---|---|---|
| Driver | The coordinator process | Plans the DAG and assigns tasks to executors |
| Executor | A worker process on a cluster node | Processes one or more partitions in parallel |
| Partition | A chunk of data | Each partition is processed by one task on one executor |
| Task | A unit of work | One task processes one partition through one stage |
| Stage | A group of tasks with no shuffle between them | Stage boundaries are created by shuffle operations |
| DAG | Directed acyclic graph of operations | Spark's execution plan; built lazily, executed on action |
After your initial answer, expect these probes
- ▸"What triggers Spark to actually run?" An action. Transformations like .filter() and .join() are lazy and just build the DAG. Actions like .count(), .collect(), and .write() trigger execution.
- ▸"What happens if an executor fails?" The driver re-assigns the failed tasks to other executors. If the data partition was lost, Spark re-computes it from the source using the DAG lineage.
- ▸"What is the DAG?" A directed acyclic graph of all transformations. Spark uses it to optimize the execution plan before running anything. Think of it as a recipe that Spark reads before cooking.
Distributed Primitives
Explain transformations, actions, and why narrow vs wide matters
When you hear these in an interview, this is the concept being tested
- ▸"What is a transformation vs an action?"
- ▸"What does lazy evaluation mean in Spark?"
- ▸"What is a narrow vs wide transformation?"
What They Want to Hear
- filter(), map(), select(), withColumn()
- Each partition processed independently
- No data moves between executors
- Fast: no network I/O
- groupBy(), join(), repartition(), distinct()
- Data must move between executors
- Creates a shuffle (network + disk I/O)
- Slow: the #1 performance bottleneck
After your initial answer, expect these probes
- ▸"Why does it matter if a transformation is narrow or wide?" Wide transformations create stage boundaries. Data must be serialized, sent over the network, and deserialized. This is the most expensive operation in Spark.
- ▸"What is the Catalyst optimizer?" Spark's query planner. It rearranges transformations for efficiency: pushes filters before joins, chooses join strategies, and eliminates unnecessary columns. You write the logic; Catalyst optimizes the execution.
- ▸"Name an action.".count(), .collect(), .show(), .write(). These trigger the DAG to execute. Until an action is called, Spark has done nothing.
Shuffle Operations
Explain shuffles and how to avoid them with broadcast joins
When you hear these in an interview, this is the concept being tested
- ▸"What is a shuffle in Spark?"
- ▸"Why is this join slow?"
- ▸"How do you avoid shuffles?"
What They Want to Hear
After your initial answer, expect these probes
- ▸"When is a broadcast join not possible?" When both sides of the join are too large to fit in executor memory. The default broadcast threshold in Spark is 10MB. You can increase it, but broadcasting a 1GB table wastes memory on every executor.
- ▸"How do you know if a shuffle is happening?" Check the Spark UI. Stage boundaries in the DAG visualization indicate shuffles. The 'Shuffle Read' and 'Shuffle Write' metrics tell you how much data moved.
- ▸"What causes the most shuffles?"groupBy, join, distinct, repartition, and window functions with PARTITION BY. Anytime Spark needs all rows with the same key on the same executor.
Memory Management
Diagnose memory problems in Spark with the right vocabulary
When you hear these in an interview, this is the concept being tested
- ▸"Your Spark job is running out of memory. What do you check?"
- ▸"How does Spark use memory?"
- ▸"What does spill-to-disk mean?"
What They Want to Hear
| Symptom | Likely Cause | Fix |
|---|---|---|
| Spill to disk | Partitions too large | Increase partition count with repartition() |
| OOM on executor | Single partition too large (data skew) | Salting or isolating the large key |
| OOM on driver | Calling .collect() on large dataset | Never collect() large data. Use .write() instead. |
| Slow joins | Not enough memory for hash table | Broadcast the smaller table or increase executor memory |
After your initial answer, expect these probes
- ▸"How do you decide how much memory to give each executor?" Start with the cluster default (usually 4-8GB). If you see spill-to-disk, increase memory or increase the number of partitions to make each one smaller. The 5-core rule: 5 cores per executor, each processing a partition.
- ▸"What is the difference between persist() and cache()?"cache() stores data in memory only. persist() lets you choose the storage level: memory-only, memory-and-disk, or disk-only. Use persist(MEMORY_AND_DISK) for safety.
- ▸"When should you cache data?" Only when the same dataset is used in multiple actions. Caching a dataset used once wastes memory. Caching a dataset used in 5 different joins saves 4 re-computations.
Small File Problem
Explain the small file problem and fix it with coalesce or compaction
When you hear these in an interview, this is the concept being tested
- ▸"Your data lake has millions of small files. What is the impact?"
- ▸"What is the optimal file size for Parquet?"
- ▸"coalesce vs repartition: what is the difference?"
What They Want to Hear
- Reduces partitions without a shuffle
- Merges partitions on the same executor
- Can only reduce, never increase
- Use when: reducing output files after filter
- Redistributes data evenly with a shuffle
- Can increase or decrease partition count
- More expensive due to the shuffle
- Use when: fixing skewed partitions or increasing parallelism
After your initial answer, expect these probes
- ▸"What causes small files?" Over-partitioning (too many partition columns), streaming jobs writing one file per micro-batch, or a Spark job with too many output partitions.
- ▸"How do you fix small files that already exist?" Run a compaction job: read the small files, coalesce, and write back. Delta Lake and Iceberg have built-in OPTIMIZE commands that do this automatically.
- ▸"Is there a too-large file problem?" Yes. Files over 1GB are hard to process in parallel because each file is processed by one task. Aim for 128-256MB so each task finishes quickly.
Answer the Spark architecture question that appears in every technical screen
- Category
- Pipeline Architecture
- Difficulty
- beginner
- Duration
- 20 minutes
- Challenges
- 0 hands-on challenges
Topics covered: Spark Execution Model, Distributed Primitives, Shuffle Operations, Memory Management, Small File Problem
Lesson Sections
- Spark Execution Model (concepts: paSparkExecutionModel)
What They Want to Hear 'Spark splits work across a cluster. The driver is the coordinator: it plans the work, divides it into tasks, and sends those tasks to executors. Executors are the workers: each one processes a partition of the data in parallel. The key insight is that Spark is lazy. It builds a plan (the DAG) but does not execute anything until you call an action like .count() or .write().' That is the answer. Driver plans, executors execute, nothing happens until an action triggers it. T
- Distributed Primitives (concepts: paDistributedPrimitives)
What They Want to Hear 'A transformation defines a new dataset from an existing one without executing anything. An action triggers execution and returns a result. Narrow transformations like filter and map process each partition independently. Wide transformations like groupBy and join require data to move between executors, which creates a shuffle.' That is the answer. Narrow = no data movement. Wide = shuffle. This distinction is the foundation of Spark performance.
- Shuffle Operations (concepts: paShuffleOptimization)
What They Want to Hear 'A shuffle redistributes data across executors. It happens when Spark needs to group or join data by a key, and the matching rows are spread across different partitions. Shuffles are expensive because every executor must write its data to disk, send it over the network, and every receiving executor must read and merge it. The number one way to avoid unnecessary shuffles is broadcast joins: if one side of the join is small enough to fit in memory, broadcast it to all execut
- Memory Management (concepts: paMemoryManagement)
What They Want to Hear 'Each executor gets a fixed amount of memory, split between storage (caching data) and execution (shuffles, joins, sorts). When execution memory runs out, Spark spills data to disk, which is much slower. When the disk fills up too, the job fails with an out-of-memory error. The fix depends on the cause: too few partitions means each one is too large, so repartition to create smaller chunks. Too much data cached means storage is crowding out execution, so unpersist unused c
- Small File Problem (concepts: paSmallFiles)
What They Want to Hear 'Too many small files kill read performance. Each file requires a separate metadata lookup, a separate file open, and a separate read request. Thousands of 1KB files are far slower to read than one 128MB file with the same data. The target file size is 128MB to 256MB. To fix small files, I use coalesce() to reduce the number of output partitions before writing, or run a compaction job that rewrites small files into larger ones.' That is the answer. Target size, the problem