PySpark Practice Problems
35+ problems across 6 categories, organized by what interviewers actually test. Each category lists the topic breakdown, difficulty, and a sample problem.
PySpark Problem Categories
DataFrame Transformations
Whether you can manipulate DataFrames without falling back to pandas or collect(). Interviewers use these as warm-up rounds, but mistakes here end interviews early.
Given a transactions DataFrame, calculate the 7-day rolling average revenue per store. Flag any day where revenue drops more than 30% from the prior week.
Window Functions
15.3% of PySpark interview questions involve window functions. Interviewers test whether you understand partitionBy vs orderBy, frame boundaries, and when rank() vs row_number() changes results.
For each customer, find the longest streak of consecutive days with at least one purchase. Return customer_id and streak_length.
Join Optimization
Whether you know when Spark broadcasts (table under 10MB), when it shuffles, and what to do when one side has a hot key holding 15GB of data. The syntax is easy. The performance reasoning is what separates candidates.
An 800M-row events table joined with a 2M-row users table takes 2 hours instead of 20 minutes. The Spark UI shows one task stuck at 15.8GB shuffle read. Diagnose and fix.
Data Skew and Partitioning
Production Spark jobs fail because of skew more often than because of wrong logic. These problems expose whether you can read partition metrics, identify power-law distributions, and fix the data layout without breaking downstream consumers.
Your nightly job writes 50,000 files to S3. Downstream Presto queries take 12 minutes to list the directory. Reduce output files without creating skewed partitions.
Production Incident Debugging
You get paged at 2am. A Spark job is breaching SLA. You see task durations, shuffle sizes, GC overhead, and the physical plan. These problems test whether you can work backward from Spark UI evidence to a root cause and a fix.
199 tasks finish in 14-22 seconds. Task 200 runs for 7,140 seconds with 78% GC overhead. The job ran fine last week. What changed?
Mock Interviews (AI-Scored)
Full interview simulation: read the pager context, write the fix, then defend your approach to an AI interviewer that asks follow-ups. Scored across 5 dimensions calibrated by seniority level.
Your fix uses broadcast. The AI asks: 'What happens when that table grows past 10MB?' Then: 'Why not salt instead?' You defend your choice.
Frequently Asked Questions
What types of PySpark problems appear in interviews?+
How many problems should I solve before an interview?+
Are these problems calibrated to real interviews?+
Continue your prep
Data Engineer Interview Prep, explore the full guide
50+ guides covering every round, company, role, and technology in the data engineer interview loop. Grounded in 2,817 verified interview reports across 921 companies, collected from real candidates.
Interview Rounds
By Company
- Stripe Data Engineer Interview
- Airbnb Data Engineer Interview
- Uber Data Engineer Interview
- Netflix Data Engineer Interview
- Databricks Data Engineer Interview
- Snowflake Data Engineer Interview
- Lyft Data Engineer Interview
- DoorDash Data Engineer Interview
- Instacart Data Engineer Interview
- Robinhood Data Engineer Interview
- Pinterest Data Engineer Interview
- Twitter/X Data Engineer Interview
By Role
- Senior Data Engineer Interview
- Staff Data Engineer Interview
- Principal Data Engineer Interview
- Junior Data Engineer Interview
- Entry-Level Data Engineer Interview
- Analytics Engineer Interview
- ML Data Engineer Interview
- Streaming Data Engineer Interview
- GCP Data Engineer Interview
- AWS Data Engineer Interview
- Azure Data Engineer Interview