Loading lesson...

The Spark Deep Dive

The technical gauntlet every pipeline interview hits

Category: Pipeline Architecture
Difficulty: intermediate
Duration: 35 minutes
Challenges: 0 hands-on challenges

Topics covered: Explain Spark Architecture, This Job Is Slow, Optimize This Join, How Do You Size the Cluster?, What About Data Skew?

Lesson Sections

Explain Spark Architecture (concepts: paSparkExecutionModel)
"Walk me through how Spark executes a query." This is the opener. If your answer is vague, the interviewer downgrades your level. If it is precise, they trust your debugging answers later. The Driver-Executor Model Spark runs one driver process and N executor processes. The driver parses your code into a logical plan, optimizes it via Catalyst, converts it to a physical plan, and splits that plan into stages. Each stage contains tasks - one task per partition. The driver sends tasks to executo
This Job Is Slow (concepts: paShuffleOptimization, paSparkExecutionModel)
"This Spark job used to take 20 minutes. Now it takes 3 hours. What do you do?" This is the most common Spark question across all companies. The interviewer is testing your debugging methodology, not a single trick. The Debugging Sequence Check shuffle metrics FIRST. 80% of slow Spark jobs are shuffle-bound. Open the Spark UI → Stages tab → sort by shuffle write. If one stage writes 500GB of shuffle data while others write 5GB, you found the problem. Do not start with code review - start with
Optimize This Join (concepts: paShuffleOptimization, paDistributedPrimitives)
"You're joining a 500GB fact table with a 2GB dimension table and it's slow. How do you fix it?" The answer they want: broadcast the dimension table. But the follow-ups go deeper. Join Strategies in Spark The broadcast threshold is controlled by spark.sql.autoBroadcastJoinThreshold, default 10MB. If one side of the join is below this threshold, Spark ships the entire table to every executor, eliminating the shuffle entirely. For dimension tables up to ~1-2GB, you can force a broadcast even above
How Do You Size the Cluster? (concepts: paMemoryManagement, paCostOptimization)
"You need to process 2TB of data daily. How do you size your Spark cluster?" This tests whether you understand memory, cores, and executors as interacting constraints rather than independent knobs. The 5-Core Rule Use 5 cores per executor. This is the well-tested sweet spot. More than 5 cores causes excessive GC pressure and HDFS throughput bottlenecks (each core opens concurrent connections). Fewer than 5 underutilizes memory. On a node with 16 cores, run 3 executors (5 cores each, 1 core reser
What About Data Skew? (concepts: paShuffleOptimization, paDistributedPrimitives)
"Your join is fast for most keys but one key takes 10x longer. What is happening and how do you fix it?" Data skew is the single most common root cause of slow Spark jobs in production. Every interviewer expects you to handle it. Detecting Skew In the Spark UI, skew shows up as one task running dramatically longer than others in the same stage. The Tasks tab shows the min, median, and max task duration. If max is 50x the median, one partition holds massively more data than the others. Check the