Problem Index

PySpark Practice Problems

35+ problems across 6 categories, organized by what interviewers actually test. Each category lists the topic breakdown, difficulty, and a sample problem.

35+ problems6 categoriesEasy to L7

PySpark Problem Categories

DataFrame Transformations

Easy to Medium|12 problems

Whether you can manipulate DataFrames without falling back to pandas or collect(). Interviewers use these as warm-up rounds, but mistakes here end interviews early.

select, filter, withColumn4
Multi-column joins3
groupBy with multiple aggregations3
Pivot and unpivot2

Given a transactions DataFrame, calculate the 7-day rolling average revenue per store. Flag any day where revenue drops more than 30% from the prior week.

Window Functions

Medium|8 problems

15.3% of PySpark interview questions involve window functions. Interviewers test whether you understand partitionBy vs orderBy, frame boundaries, and when rank() vs row_number() changes results.

row_number, rank, dense_rank3
lag, lead, running totals3
Session detection and gaps2

For each customer, find the longest streak of consecutive days with at least one purchase. Return customer_id and streak_length.

Join Optimization

Medium to Hard|6 problems

Whether you know when Spark broadcasts (table under 10MB), when it shuffles, and what to do when one side has a hot key holding 15GB of data. The syntax is easy. The performance reasoning is what separates candidates.

Broadcast vs sort-merge selection2
Salting skewed keys2
Bucketed table design2

An 800M-row events table joined with a 2M-row users table takes 2 hours instead of 20 minutes. The Spark UI shows one task stuck at 15.8GB shuffle read. Diagnose and fix.

Data Skew and Partitioning

Hard|4 problems

Production Spark jobs fail because of skew more often than because of wrong logic. These problems expose whether you can read partition metrics, identify power-law distributions, and fix the data layout without breaking downstream consumers.

Repartition vs coalesce tradeoffs1
AQE configuration and limits1
Small file compaction1
Partition pruning design1

Your nightly job writes 50,000 files to S3. Downstream Presto queries take 12 minutes to list the directory. Reduce output files without creating skewed partitions.

Production Incident Debugging

Hard|5 problems

You get paged at 2am. A Spark job is breaching SLA. You see task durations, shuffle sizes, GC overhead, and the physical plan. These problems test whether you can work backward from Spark UI evidence to a root cause and a fix.

Executor OOM diagnosis1
GC pressure from cached data1
Shuffle explosion from repartition1
Catalyst plan regression1
Broadcast overflow failure1

199 tasks finish in 14-22 seconds. Task 200 runs for 7,140 seconds with 78% GC overhead. The job ran fine last week. What changed?

Mock Interviews (AI-Scored)

L5 to L7|Multi-phase

Full interview simulation: read the pager context, write the fix, then defend your approach to an AI interviewer that asks follow-ups. Scored across 5 dimensions calibrated by seniority level.

Think phase: read Spark UI evidence
Code phase: write and submit fix
Discuss phase: defend tradeoffs
Verdict: 5-dimension scoring

Your fix uses broadcast. The AI asks: 'What happens when that table grows past 10MB?' Then: 'Why not salt instead?' You defend your choice.

Frequently Asked Questions

What types of PySpark problems appear in interviews?+
DataFrame transformations and window functions appear in nearly every PySpark interview. Senior roles (L5+) add join optimization with skew handling, production debugging from Spark UI evidence, and system design for pipelines processing 100+ PB. The ratio shifts toward debugging and design as seniority increases.
How many problems should I solve before an interview?+
Solve at least 3-4 per category. The goal is pattern recognition, not volume. After 4 window function problems, you should recognize the partitionBy/orderBy frame pattern instantly. After 3 join optimization problems, you should know when to broadcast, when to salt, and when to bucket without thinking.
Are these problems calibrated to real interviews?+
Yes. The difficulty and topic mix reflect actual interview loops at companies using Spark at scale. DataFrame and window problems dominate early rounds. Production debugging and system design dominate senior rounds.