Loading lesson...

The Spark Deep Dive

The technical gauntlet every pipeline interview hits

Challenges: 0 hands-on challenges

Lesson Sections

Explain Spark Architecture (concepts: paSparkExecutionModel, paDistributedPrimitives)
"Explain Spark architecture" at the advanced level means explaining where the standard model breaks down. Adaptive Query Execution, dynamic allocation, and speculative tasks are the three mechanisms that override the static plan - and each can go wrong. Adaptive Query Execution (AQE) AQE re-optimizes the query plan at shuffle boundaries using runtime statistics. After each stage completes, Spark collects actual partition sizes and row counts, then replans the remaining stages. Three optimizati
This Job Is Slow (concepts: paSparkExecutionModel, paMemoryManagement)
The debugging question escalates from 'use the Spark UI' to 'walk me through your systematic methodology when the Spark UI isn't enough.' Flame graphs, GC analysis, and spill forensics are the tools that separate debugging from guessing. The Systematic Methodology GC Analysis Spark defaults to G1GC for executors. When the old generation fills, G1 triggers full GC pauses that halt all task execution. A healthy executor spends < 5% of time in GC. Above 10%, you are memory-starved. Above 20%, the e
Optimize This Join (concepts: paShuffleOptimization, paDistributedPrimitives)
The join question goes beyond 'use a broadcast join.' The interviewer wants you to reason about shuffle internals, Tungsten memory management, and when AQE's automatic strategy switching helps or hurts. Sort-Merge Join Internals Sort-merge join is Spark's default for two large tables. Both sides are shuffled by the join key, then each partition is sorted. The merge phase walks both sorted partitions with two pointers, producing matches in O(n + m). The cost: two full shuffles plus two sorts. The
How Do You Size the Cluster? (concepts: paCostOptimization, paMemoryManagement)
Cluster sizing becomes cost modeling. The question shifts from 'how many executors' to 'how do you minimize $/query while meeting the SLA?' Dynamic allocation, spot instances, cluster pooling, and chargeback models are the dimensions. Dynamic Allocation in Practice Dynamic allocation requests executors when tasks are pending and releases them after an idle timeout (default 60s). The scaling is reactive, not predictive - there is a 30-60 second lag between demand and allocation because YARN/K8s
What About Data Skew? (concepts: paShuffleOptimization, paDistributedPrimitives, paSmallFiles)
The skew question goes beyond salting. The interviewer wants to see custom partitioners, two-phase aggregation, pre-processing pipelines that eliminate skew before it reaches the join, and the judgment to know which approach fits which situation. Adaptive Skew Handling (AQE) AQE's skew join optimization detects skewed partitions after the shuffle write and automatically splits them. It compares each partition's size to the median: if a partition exceeds skewedPartitionFactor × median AND skewedP