PySpark Interview Practice (Coding + Scenarios)
A PySpark interview is 2 different rounds wedged into 1. Round 1 is coding: write a DataFrame query that produces the right output. Round 2 is diagnosis: read a Spark UI screenshot, figure out why the job is slow, propose a fix. Almost every PySpark prep site addresses round 1. The scenarios below address round 2.
A PySpark interview is 2 different rounds wedged into 1. Round 1 is coding: write a DataFrame query that produces the right output. Round 2 is diagnosis: read a Spark UI screenshot, figure out why the job is slow, propose a fix. Almost every PySpark prep site addresses round 1. The scenarios below address round 2.
Know PySpark the way the interviewer who asks it knows it.
Coding vs scenario vs design, by seniority level
PySpark interviews shift toward debugging and design as level rises. The mix is roughly what to expect.
Coding mode: 6 patterns interviewers reach for
Each problem has a multi-seed grader. Tiebreakers and frame clauses get checked explicitly.
Given a 50M-row transactions DataFrame, derive is_weekend, filter to weekend rows, sort by amount DESC, return the top 100.
select, filter, withColumn, orderBy, limit
Per category and month, compute total revenue, distinct customer count, and percentage of revenue from new customers in a single agg().
agg, F.when, F.countDistinct, named columns
For each customer, return the 3 highest-value orders. Tiebreak on ordered_at DESC, then order_id ASC. The grader checks ties explicitly.
Window.partitionBy, orderBy with tiebreaker, row_number
Group events into 30-minute timeout sessions per user. Return user_id, session_id, start, end, event_count.
F.lag, F.when, F.sum as accumulator, groupBy
Join 8M-row events with 2M-row users. Pick the right join strategy and defend the choice with the data sizes.
F.broadcast hint, autoBroadcastJoinThreshold, explain plan reading
Given evidence of a hot key (1 task at 15.8GB shuffle read while others are 100MB), implement salting and unsalt the result correctly.
F.rand, crossJoin, multi-key join, post-join aggregation
Scenario mode: 5 production incidents to walk through
Each scenario shows the pager-grade evidence. The AI interviewer asks what you'd check, what you'd try, and why. Verdict scores against the rubric.
199 tasks finish in 14-22 seconds. Task 200 runs for 7,140 seconds with 78% GC overhead. The job ran fine last week. Spark UI metrics for the slow task: shuffle read: 12.4 GB (others: 80-120 MB) records read: 412 M (others: 6-8 M) task heap: 4 GB (executor max: 6 GB)
- ▸Identify skew on a specific key
- ▸Propose salting OR broadcast OR AQE skew join
- ▸Defend the choice against the data shape
- ▸Name 1 way to verify the fix worked
events table: 800 M rows, partitioned by event_date users table: 2 M rows, single partition Join: events.join(users, "user_id") SLA: 20 min Actual: 2 hours, 1 task stuck at 15.8 GB shuffle read spark.sql.autoBroadcastJoinThreshold: 10 MB (default)
- ▸Recognize the small side is over threshold (users ~120 MB)
- ▸Choose between broadcast hint and threshold bump
- ▸Spot the hot key as separate issue (15.8 GB on 1 task)
- ▸Propose 2-stage fix: broadcast + salt
pipeline: raw = spark.read.parquet(...) # 50 GB cleaned = raw.transform(...) cleaned.persist(MEMORY_ONLY) # ← OOM here cleaned.count() # trigger # downstream uses cleaned 3 times cluster: 4 executors, 16 GB heap each (total: 64 GB) cached size at OOM: 78 GB
- ▸Realize cached size > total heap (compression vs raw)
- ▸Switch to MEMORY_AND_DISK or DISK_ONLY
- ▸Consider checkpoint() instead of persist()
- ▸Ask whether the 3 reuses actually need the same DF
Yesterday: spark.read.parquet("/data/events/") worked.
Today: AnalysisException: cannot resolve 'new_field'
upstream added a new optional column at 4pm yesterday.
production cluster has files from both before and after.- ▸Explain Parquet schema-on-read behavior
- ▸Choose mergeSchema (with cost note) or explicit schema
- ▸Suggest CI test that would have caught this
- ▸Note: don't bury the bug, surface it in logs
nightly write:
df.write.partitionBy("year","month","day","hour")
.parquet("s3://lake/events")
result: 50,000 small files (~2 KB each) per day
downstream Presto LIST takes 12 minutes- ▸Identify too-fine partitioning
- ▸Propose coalesce/repartition before write
- ▸Discuss bucketBy/sortBy as alternative
- ▸Question whether hourly partitioning is needed at all
PySpark interview practice FAQ
What kind of PySpark questions do FAANG and Spark-shop companies actually ask?+
Do I need a Databricks account for this?+
How is this different from reading a 'top 40 PySpark interview questions' blog?+
What's the most common PySpark interview mistake?+
How long should I prep for a PySpark interview?+
Will the scenarios feel like a real production incident?+
Walk through scenario S1
- 01
Active recall beats re-reading by 50%
Cognitive-science meta-reviews (Dunlosky et al., 2013) rank practice testing as a top-tier study technique, while re-reading and highlighting rank near the bottom
- 02
76% of hiring managers reject on the coding task, not the resume
From HackerRank's 2024 Developer Skills Report. Candidates who look strong on paper still fail the live screen if they haven't done timed, executable practice
- 03
Five problem shapes cover 80% of data engineer loops
Dedup, sessionization, top-N-per-group, slowly-changing dimensions, partition tricks. Writing the shapes by hand turns the unfamiliar into pattern recognition