Spark Mock Interview
4 phases. Production incident debugging with Spark UI evidence. Write the fix, defend your approach, get scored across 5 dimensions. Calibrated from L3 to L7.
Four Interview Phases
Think
5 minYou get paged. Read the incident context: Spark UI task durations, shuffle sizes, GC overhead, executor memory, and the physical plan. Formulate your diagnosis before writing code.
Tests whether you read evidence before coding. Jumping straight to a fix is the most common L3 mistake.
Code
15 minWrite your fix in PySpark or Scala. The code is checked against fix-detection markers: must-contain patterns, optional improvements, and antipatterns like collect() or repartition(1).
Tests whether your fix targets the actual bottleneck. A correct diagnosis with a wrong fix still fails.
Discuss
10 minThe AI interviewer asks follow-ups one at a time. 'What happens when the table doubles?' 'Why not just add more executors?' Each question probes a different dimension.
Tests whether you can reason about tradeoffs and edge cases beyond the immediate fix.
Verdict
ReviewSee the optimal fix, the ground truth diagnosis, and your scores across 5 dimensions. Calibrated from L3 (junior) to L7 (staff).
Identifies specific gaps: diagnosis speed, code correctness, tradeoff reasoning, failure mode awareness.
Spark Incident Scenarios
Each scenario is a production incident with unique Spark UI evidence. These are the 5 failure patterns that cause the most SLA breaches in production Spark clusters.
Data skew on power-law keys
One partition holds 320M rows while others hold 3-4M. Task 200 runs for 7,140 seconds.
Broadcast overflow
Table grew past the 10MB threshold silently. OOM on driver during broadcast.
Shuffle explosion
Repartition before join multiplied shuffle volume by 50x. Network saturated.
Executor OOM from cached data
100GB cached table competes with execution memory. Unified pool (60% of heap) cannot serve both.
Catalyst plan regression
CBO statistics went stale. Spark picked sort-merge instead of broadcast. Runtime went from 8 min to 2 hours.
Scoring Dimensions
Scored on 5 dimensions. A junior and a senior giving the same answer receive different scores because expectations scale with level.
Problem Solving
Systematic diagnosis using Spark UI evidence. Did you identify the root cause before coding?
Technical Execution
Correct fix targeting the actual bottleneck: skew, shuffle strategy, join type, or memory configuration.
Communication
Clear articulation of why the fix works and what tradeoffs it introduces.
Verification
Consideration of failure modes. What if the broadcast table grows past 10MB? What if skew shifts to a different key?
Requirements Understanding
Fix meets SLA and operational constraints. Does not introduce new failure modes or break downstream consumers.
Sample: Skewed Viewing Events Pipeline
Pager Alert
viewing_engagement job runtime 135 min, SLA 60 min. Stage 2 SortMergeJoin stuck.
Spark UI Evidence
199 tasks: 14-22 seconds, 150-225 MB shuffle read. Task 200: 7,140 seconds, 320M rows, 15.8 GB shuffle read, 78% GC overhead.
Root Cause
Power-law skew. Top 1% of subscribers generate 40% of events. One partition holds 320M rows while others hold 3-4M.
Fix
# Before: SortMergeJoin with skew (135 min)
joined = events.join(users, "user_id")
# After: BroadcastHashJoin, no shuffle (12 min)
joined = events.join(F.broadcast(users), "user_id")