PySpark Practice Problems
35+ problems across 6 categories, organized by what interviewers actually test. Each category lists the topic breakdown, difficulty, and a sample problem.
PySpark Problem Categories
DataFrame Transformations
Whether you can manipulate DataFrames without falling back to pandas or collect(). Interviewers use these as warm-up rounds, but mistakes here end interviews early.
Given a transactions DataFrame, calculate the 7-day rolling average revenue per store. Flag any day where revenue drops more than 30% from the prior week.
Window Functions
15.3% of PySpark interview questions involve window functions. Interviewers test whether you understand partitionBy vs orderBy, frame boundaries, and when rank() vs row_number() changes results.
For each customer, find the longest streak of consecutive days with at least one purchase. Return customer_id and streak_length.
Join Optimization
Whether you know when Spark broadcasts (table under 10MB), when it shuffles, and what to do when one side has a hot key holding 15GB of data. The syntax is easy. The performance reasoning is what separates candidates.
An 800M-row events table joined with a 2M-row users table takes 2 hours instead of 20 minutes. The Spark UI shows one task stuck at 15.8GB shuffle read. Diagnose and fix.
Data Skew and Partitioning
Production Spark jobs fail because of skew more often than because of wrong logic. These problems expose whether you can read partition metrics, identify power-law distributions, and fix the data layout without breaking downstream consumers.
Your nightly job writes 50,000 files to S3. Downstream Presto queries take 12 minutes to list the directory. Reduce output files without creating skewed partitions.
Production Incident Debugging
You get paged at 2am. A Spark job is breaching SLA. You see task durations, shuffle sizes, GC overhead, and the physical plan. These problems test whether you can work backward from Spark UI evidence to a root cause and a fix.
199 tasks finish in 14-22 seconds. Task 200 runs for 7,140 seconds with 78% GC overhead. The job ran fine last week. What changed?
Mock Interviews (AI-Scored)
Full interview simulation: read the pager context, write the fix, then defend your approach to an AI interviewer that asks follow-ups. Scored across 5 dimensions calibrated by seniority level.
Your fix uses broadcast. The AI asks: 'What happens when that table grows past 10MB?' Then: 'Why not salt instead?' You defend your choice.