PySpark Practice Problems — 35+ Problems by Category (2026)
35+ problems across 6 categories, organized by what interviewers test. DataFrame fluency for warm-ups, window functions for early-round filtering, join optimization and production debugging for senior rounds. Each category lists the topic breakdown, difficulty, and a sample problem.
Foundation categories: DataFrame fluency
Every PySpark interview opens with these. Mistakes here end interviews before you reach the harder material.
DataFrame Transformations
Whether you can manipulate DataFrames without falling back to pandas or collect(). Interviewers use these as warm-up rounds, but mistakes here end interviews early. Topics: select/filter/withColumn (4), multi-column joins (3), groupBy with multiple aggregations (3), pivot and unpivot (2). Sample problem: given a transactions DataFrame, calculate the 7-day rolling average revenue per store and flag any day where revenue drops more than 30% from the prior week.
Window Functions
15.3% of PySpark interview questions involve window functions. Interviewers test whether you understand partitionBy vs orderBy, frame boundaries, and when rank() vs row_number() changes results. Topics: row_number/rank/dense_rank (3), lag/lead/running totals (3), session detection and gaps-and-islands (2). Sample problem: for each customer, find the longest streak of consecutive days with at least one purchase. Return customer_id and streak_length.
PySpark API quick reference
Ten methods cover ~80% of practice problems. Memorize these idioms; the rest you can look up.
| Method | What it does | When you'd use it |
|---|---|---|
| F.col / F.lit | Column expressions. F.col('x') > F.lit(0) | Filtering, withColumn, agg |
| F.when().otherwise() | Conditional column. Spark equivalent of CASE WHEN. | Categorization, null handling |
| F.row_number().over(w) | Per-partition rank. Use w = Window.partitionBy(...).orderBy(...) | Dedup, top-N per group |
| F.lag(col, 1).over(w) | Previous row's value within the window. Use for deltas and session detection. | Time series, sessionization |
| F.sum().over(unboundedWindow) | Running total. Unbounded preceding to current row. | Cumulative metrics |
| df.repartition(N, col) | Shuffle into N partitions hashed by col. Forces a full shuffle. | Pre-join layout, controlling parallelism |
| df.coalesce(N) | Reduce to N partitions WITHOUT a shuffle. Can skew if input partitions are uneven. | Reducing output files before write |
| df.broadcast() | Hint Spark to broadcast this DataFrame. Bypasses the 10MB auto-broadcast threshold. | Joining a small lookup table |
| df.cache() / .persist() | Memoize the DataFrame in memory (and optionally disk). Only useful if reused. | Iterative algorithms, dashboards |
| df.explain(True) | Print logical, optimized, and physical plans. Essential debugging tool. | Diagnosis, performance review |
Performance categories: where senior interviews are won
The syntax for joins, partitioning, and debugging is easy. The reasoning about physical plans, shuffle volume, and skew is what separates L4 from L5+.
Join Optimization
Whether you know when Spark broadcasts (table under 10MB), when it shuffles, and what to do when one side has a hot key holding 15GB of data. The syntax is easy. The performance reasoning is what separates candidates. Topics: broadcast vs sort-merge selection (2), salting skewed keys (2), bucketed table design (2). Sample problem: an 800M-row events table joined with a 2M-row users table takes 2 hours instead of 20 minutes. The Spark UI shows one task stuck at 15.8GB shuffle read. Diagnose and fix.
Data Skew and Partitioning
Production Spark jobs fail because of skew more often than because of wrong logic. These problems expose whether you can read partition metrics, identify power-law distributions, and fix the data layout without breaking downstream consumers. Topics: repartition vs coalesce tradeoffs, AQE configuration and limits, small file compaction, partition pruning design. Sample problem: your nightly job writes 50,000 files to S3. Downstream Presto queries take 12 minutes to list the directory. Reduce output files without creating skewed partitions.
Production Incident Debugging
You get paged at 2am. A Spark job is breaching SLA. You see task durations, shuffle sizes, GC overhead, and the physical plan. These problems test whether you can work backward from Spark UI evidence to a root cause and a fix. Topics: executor OOM diagnosis, GC pressure from cached data, shuffle explosion from repartition, Catalyst plan regression, broadcast overflow failure. Sample problem: 199 tasks finish in 14–22 seconds. Task 200 runs for 7,140 seconds with 78% GC overhead. The job ran fine last week. What changed?
Mock Interviews (AI-Scored)
Full interview simulation: read the pager context, write the fix, then defend your approach to an AI interviewer that asks follow-ups. Scored across 5 dimensions calibrated by seniority level. Phases: think (read Spark UI evidence), code (write and submit fix), discuss (defend tradeoffs), verdict (5-dimension scoring). Sample exchange: your fix uses broadcast. The AI asks 'what happens when that table grows past 10MB?' Then 'why not salt instead?' You defend your choice.
Sample solution: longest consecutive-day purchase streak
from pyspark.sql import functions as F
from pyspark.sql.window import Window
# Problem: longest streak of consecutive days a customer made a purchase.
#
# Input: purchases(customer_id, purchase_date)
# Output: customer_id, longest_streak
# Step 1: rank each row per customer, ordered by date
w_rank = Window.partitionBy("customer_id").orderBy("purchase_date")
ranked = purchases.withColumn("rn", F.row_number().over(w_rank))
# Step 2: subtract row number from date to get a constant "streak key"
# for every consecutive day. Gaps shift the key.
streak_key = F.expr("date_sub(purchase_date, rn)")
keyed = ranked.withColumn("streak_key", streak_key)
# Step 3: count streak length per (customer, streak_key)
streaks = (
keyed
.groupBy("customer_id", "streak_key")
.agg(F.count("*").alias("streak_length"))
)
# Step 4: take the max streak per customer
result = (
streaks
.groupBy("customer_id")
.agg(F.max("streak_length").alias("longest_streak"))
)
# Common interview follow-up: how would you change this if "consecutive"
# means within 7 days, not exactly 1 day? Use F.lag(purchase_date) and
# F.when(date_diff > 7, 1).otherwise(0).cumsum() as the streak boundary.The gaps-and-islands window pattern. The row_number trick (subtract from date to create a per-streak key) appears in ~1 in 5 PySpark window problems.
What interviewers reward in PySpark rounds
Four patterns that distinguish a strong submission from a passing one. Each is independent — apply whichever fits the problem in front of you.
State the join strategy before writing code
Strong candidates say 'I'll broadcast users since it's 50MB' or 'I'll salt the events side because the top 1% of users hold most rows' BEFORE typing. Weak candidates write df.join(other, 'key') and hope. The senior interviewer is listening for the strategy declaration. Once you state it, write the code that matches.
Read the Spark UI before guessing the fix
In a debugging problem, the interviewer expects you to walk the UI: stages tab, slow stage's tasks, summary metrics (median vs max), shuffle read size, GC time. Only after you name what the evidence shows do you propose a fix. Candidates who jump to 'try cache()' without reading the UI usually fail the round.
Name the partition count explicitly
When you write repartition() or set shuffle.partitions, justify the number. 'I'll repartition to 200 because we have ~10GB of data and target 50MB per partition.' Default 200 is wrong for most production data. Random numbers like 100 or 1000 without justification signal you don't understand the underlying memory model.
Explain why you didn't use the obvious option
If a candidate uses sort-merge join when broadcast would work, the interviewer asks why. Strong answer: 'broadcast would OOM the executors because the dimension table is 800MB.' Weak answer: 'I always use sort-merge.' Stating the rejected alternative shows architectural reasoning, which is what staff-track interviews score on.
Frequently asked questions
What types of PySpark problems appear in interviews?+
How many problems should I solve before an interview?+
Are these problems calibrated to real interviews?+
Do I need a Spark cluster to practice?+
How is PySpark different from Spark Scala for interviews?+
The candidate who gets the offer
- 01
Active recall beats re-reading by 50%
Cognitive-science meta-reviews (Dunlosky et al., 2013) rank practice testing as a top-tier study technique, while re-reading and highlighting rank near the bottom
- 02
76% of hiring managers reject on the coding task, not the resume
From HackerRank's 2024 Developer Skills Report. Candidates who look strong on paper still fail the live screen if they haven't done timed, executable practice
- 03
Five problem shapes cover 80% of data engineer loops
Dedup, sessionization, top-N-per-group, slowly-changing dimensions, partition tricks. Writing the shapes by hand turns the unfamiliar into pattern recognition