PySpark Practice Problems — 35+ Problems by Category (2026)

35+ problems across 6 categories, organized by what interviewers test. DataFrame fluency for warm-ups, window functions for early-round filtering, join optimization and production debugging for senior rounds. Each category lists the topic breakdown, difficulty, and a sample problem.

35+

PySpark practice problems

Topical categories

L5–L7

Seniority calibration

5-phase

AI-scored mock interview

Foundation categories: DataFrame fluency

Every PySpark interview opens with these. Mistakes here end interviews before you reach the harder material.

12 problems · Easy–Medium

DataFrame Transformations

Whether you can manipulate DataFrames without falling back to pandas or collect(). Interviewers use these as warm-up rounds, but mistakes here end interviews early. Topics: select/filter/withColumn (4), multi-column joins (3), groupBy with multiple aggregations (3), pivot and unpivot (2). Sample problem: given a transactions DataFrame, calculate the 7-day rolling average revenue per store and flag any day where revenue drops more than 30% from the prior week.

Most-tested: groupBy with multiple aggs

8 problems · Medium

Window Functions

15.3% of PySpark interview questions involve window functions. Interviewers test whether you understand partitionBy vs orderBy, frame boundaries, and when rank() vs row_number() changes results. Topics: row_number/rank/dense_rank (3), lag/lead/running totals (3), session detection and gaps-and-islands (2). Sample problem: for each customer, find the longest streak of consecutive days with at least one purchase. Return customer_id and streak_length.

Gaps-and-islands is the highest-leverage pattern

PySpark API quick reference

Ten methods cover ~80% of practice problems. Memorize these idioms; the rest you can look up.

Method	What it does	When you'd use it
F.col / F.lit	Column expressions. F.col('x') > F.lit(0)	Filtering, withColumn, agg
F.when().otherwise()	Conditional column. Spark equivalent of CASE WHEN.	Categorization, null handling
F.row_number().over(w)	Per-partition rank. Use w = Window.partitionBy(...).orderBy(...)	Dedup, top-N per group
F.lag(col, 1).over(w)	Previous row's value within the window. Use for deltas and session detection.	Time series, sessionization
F.sum().over(unboundedWindow)	Running total. Unbounded preceding to current row.	Cumulative metrics
df.repartition(N, col)	Shuffle into N partitions hashed by col. Forces a full shuffle.	Pre-join layout, controlling parallelism
df.coalesce(N)	Reduce to N partitions WITHOUT a shuffle. Can skew if input partitions are uneven.	Reducing output files before write
df.broadcast()	Hint Spark to broadcast this DataFrame. Bypasses the 10MB auto-broadcast threshold.	Joining a small lookup table
df.cache() / .persist()	Memoize the DataFrame in memory (and optionally disk). Only useful if reused.	Iterative algorithms, dashboards
df.explain(True)	Print logical, optimized, and physical plans. Essential debugging tool.	Diagnosis, performance review

Performance categories: where senior interviews are won

The syntax for joins, partitioning, and debugging is easy. The reasoning about physical plans, shuffle volume, and skew is what separates L4 from L5+.

6 problems · Medium–Hard

Join Optimization

Whether you know when Spark broadcasts (table under 10MB), when it shuffles, and what to do when one side has a hot key holding 15GB of data. The syntax is easy. The performance reasoning is what separates candidates. Topics: broadcast vs sort-merge selection (2), salting skewed keys (2), bucketed table design (2). Sample problem: an 800M-row events table joined with a 2M-row users table takes 2 hours instead of 20 minutes. The Spark UI shows one task stuck at 15.8GB shuffle read. Diagnose and fix.

Salting is the highest-yield optimization

4 problems · Hard

Data Skew and Partitioning

Production Spark jobs fail because of skew more often than because of wrong logic. These problems expose whether you can read partition metrics, identify power-law distributions, and fix the data layout without breaking downstream consumers. Topics: repartition vs coalesce tradeoffs, AQE configuration and limits, small file compaction, partition pruning design. Sample problem: your nightly job writes 50,000 files to S3. Downstream Presto queries take 12 minutes to list the directory. Reduce output files without creating skewed partitions.

AQE doesn't fix everything

5 problems · Hard

Production Incident Debugging

You get paged at 2am. A Spark job is breaching SLA. You see task durations, shuffle sizes, GC overhead, and the physical plan. These problems test whether you can work backward from Spark UI evidence to a root cause and a fix. Topics: executor OOM diagnosis, GC pressure from cached data, shuffle explosion from repartition, Catalyst plan regression, broadcast overflow failure. Sample problem: 199 tasks finish in 14–22 seconds. Task 200 runs for 7,140 seconds with 78% GC overhead. The job ran fine last week. What changed?

Diagnosis is the differentiator at L6+

L5–L7 · Multi-phase

Mock Interviews (AI-Scored)

Full interview simulation: read the pager context, write the fix, then defend your approach to an AI interviewer that asks follow-ups. Scored across 5 dimensions calibrated by seniority level. Phases: think (read Spark UI evidence), code (write and submit fix), discuss (defend tradeoffs), verdict (5-dimension scoring). Sample exchange: your fix uses broadcast. The AI asks 'what happens when that table grows past 10MB?' Then 'why not salt instead?' You defend your choice.

Closest format to real onsite loops

Sample solution: longest consecutive-day purchase streak

from pyspark.sql import functions as F
from pyspark.sql.window import Window

# Problem: longest streak of consecutive days a customer made a purchase.
#
# Input: purchases(customer_id, purchase_date)
# Output: customer_id, longest_streak

# Step 1: rank each row per customer, ordered by date
w_rank = Window.partitionBy("customer_id").orderBy("purchase_date")
ranked = purchases.withColumn("rn", F.row_number().over(w_rank))

# Step 2: subtract row number from date to get a constant "streak key"
# for every consecutive day. Gaps shift the key.
streak_key = F.expr("date_sub(purchase_date, rn)")
keyed = ranked.withColumn("streak_key", streak_key)

# Step 3: count streak length per (customer, streak_key)
streaks = (
    keyed
    .groupBy("customer_id", "streak_key")
    .agg(F.count("*").alias("streak_length"))
)

# Step 4: take the max streak per customer
result = (
    streaks
    .groupBy("customer_id")
    .agg(F.max("streak_length").alias("longest_streak"))
)

# Common interview follow-up: how would you change this if "consecutive"
# means within 7 days, not exactly 1 day? Use F.lag(purchase_date) and
# F.when(date_diff > 7, 1).otherwise(0).cumsum() as the streak boundary.

The gaps-and-islands window pattern. The row_number trick (subtract from date to create a per-streak key) appears in ~1 in 5 PySpark window problems.

What interviewers reward in PySpark rounds

Four patterns that distinguish a strong submission from a passing one. Each is independent — apply whichever fits the problem in front of you.

Joins

State the join strategy before writing code

Strong candidates say 'I'll broadcast users since it's 50MB' or 'I'll salt the events side because the top 1% of users hold most rows' BEFORE typing. Weak candidates write df.join(other, 'key') and hope. The senior interviewer is listening for the strategy declaration. Once you state it, write the code that matches.

Strategy first, syntax second

Debugging

Read the Spark UI before guessing the fix

In a debugging problem, the interviewer expects you to walk the UI: stages tab, slow stage's tasks, summary metrics (median vs max), shuffle read size, GC time. Only after you name what the evidence shows do you propose a fix. Candidates who jump to 'try cache()' without reading the UI usually fail the round.

Evidence then hypothesis then fix

Performance

Name the partition count explicitly

When you write repartition() or set shuffle.partitions, justify the number. 'I'll repartition to 200 because we have ~10GB of data and target 50MB per partition.' Default 200 is wrong for most production data. Random numbers like 100 or 1000 without justification signal you don't understand the underlying memory model.

Target 50–200MB per partition

Tradeoffs

Explain why you didn't use the obvious option

If a candidate uses sort-merge join when broadcast would work, the interviewer asks why. Strong answer: 'broadcast would OOM the executors because the dimension table is 800MB.' Weak answer: 'I always use sort-merge.' Stating the rejected alternative shows architectural reasoning, which is what staff-track interviews score on.

Volunteer the tradeoff

Frequently asked questions

What types of PySpark problems appear in interviews?+

DataFrame transformations and window functions appear in nearly every PySpark interview. Senior roles (L5+) add join optimization with skew handling, production debugging from Spark UI evidence, and system design for pipelines processing 100+ PB. The ratio shifts toward debugging and design as seniority increases.

How many problems should I solve before an interview?+

Solve at least 3–4 per category. The goal is pattern recognition, not volume. After 4 window function problems, you should recognize the partitionBy/orderBy frame pattern instantly. After 3 join optimization problems, you should know when to broadcast, when to salt, and when to bucket without thinking.

Are these problems calibrated to real interviews?+

Yes. The difficulty and topic mix reflect actual interview loops at companies using Spark at scale. DataFrame and window problems dominate early rounds. Production debugging and system design dominate senior rounds.

Do I need a Spark cluster to practice?+

No. PySpark runs in local mode on a laptop — pip install pyspark, then SparkSession.builder.master('local[*]'). Real-scale problems (skew, OOM, GC) need cluster context to reproduce, but the API mechanics and physical-plan reading work identically locally. Most of the practice value is in the DataFrame fluency, which a laptop fully supports.

How is PySpark different from Spark Scala for interviews?+

DataFrame API is essentially identical — same methods, same semantics, same physical plans. Two real differences for interviews: (1) Python UDFs serialize each row to a Python process, paying a 10–100x cost over Scala UDFs; (2) some libraries (typed Datasets, Spark ML model serving) are richer in Scala. The Catalyst optimizer, Tungsten encoding, and execution engine are language-agnostic. Choose whichever your target company uses.

02 / Why practice

The candidate who gets the offer

01
Active recall beats re-reading by 50%
Cognitive-science meta-reviews (Dunlosky et al., 2013) rank practice testing as a top-tier study technique, while re-reading and highlighting rank near the bottom
02
76% of hiring managers reject on the coding task, not the resume
From HackerRank's 2024 Developer Skills Report. Candidates who look strong on paper still fail the live screen if they haven't done timed, executable practice
03
Five problem shapes cover 80% of data engineer loops
Dedup, sessionization, top-N-per-group, slowly-changing dimensions, partition tricks. Writing the shapes by hand turns the unfamiliar into pattern recognition

Start solving PySpark problems

Related PySpark prep

PySpark coding practice→

How the grading and AI feedback works

PySpark interview questions→

Q&A format with answer guidance by seniority

Spark mock interview→

4-phase simulation scored by seniority level