PySpark Interview Practice (Coding + Scenarios)

A PySpark interview is 2 different rounds wedged into 1. Round 1 is coding: write a DataFrame query that produces the right output. Round 2 is diagnosis: read a Spark UI screenshot, figure out why the job is slow, propose a fix. Almost every PySpark prep site addresses round 1. The scenarios below address round 2.

Start a PySpark mock Coding mode

Coding problems

Scenario walkthroughs

Pushes you on tradeoffs

PySpark 3.5

Real sandbox, no signup

Coding vs scenario vs design, by seniority level

PySpark interviews shift toward debugging and design as level rises. The mix is roughly what to expect.

PySpark interview question types, by seniority level

Mid-level (L4 / SDE II)70c·20s·10d

Senior (L5 / SDE III)45c·35s·20d

Staff (L6 / Principal)25c·35s·40d

Coding Scenario / debug Design

Coding mode: 6 patterns interviewers reach for

Each problem has a multi-seed grader. Tiebreakers and frame clauses get checked explicitly.

01DataFrame fluency

Easy · ~10 min

Given a 50M-row transactions DataFrame, derive is_weekend, filter to weekend rows, sort by amount DESC, return the top 100.

select, filter, withColumn, orderBy, limit

02groupBy with conditional aggregation

Easy-Medium · ~12 min

Per category and month, compute total revenue, distinct customer count, and percentage of revenue from new customers in a single agg().

agg, F.when, F.countDistinct, named columns

03Top-N per group via window

Medium · ~15 min

For each customer, return the 3 highest-value orders. Tiebreak on ordered_at DESC, then order_id ASC. The grader checks ties explicitly.

Window.partitionBy, orderBy with tiebreaker, row_number

04Sessionization

Medium-Hard · ~20 min

Group events into 30-minute timeout sessions per user. Return user_id, session_id, start, end, event_count.

F.lag, F.when, F.sum as accumulator, groupBy

05Broadcast vs sort-merge selection

Medium-Hard · ~15 min

Join 8M-row events with 2M-row users. Pick the right join strategy and defend the choice with the data sizes.

F.broadcast hint, autoBroadcastJoinThreshold, explain plan reading

06Salting a hot key

Hard · ~25 min

Given evidence of a hot key (1 task at 15.8GB shuffle read while others are 100MB), implement salting and unsalt the result correctly.

F.rand, crossJoin, multi-key join, post-join aggregation

Scenario mode: 5 production incidents to walk through

Each scenario shows the pager-grade evidence. The AI interviewer asks what you'd check, what you'd try, and why. Verdict scores against the rubric.

S1: The 78% GC overhead mysteryHard · 35 min

199 tasks finish in 14-22 seconds.
Task 200 runs for 7,140 seconds with 78% GC overhead.
The job ran fine last week.

Spark UI metrics for the slow task:
  shuffle read:     12.4 GB  (others: 80-120 MB)
  records read:     412 M    (others: 6-8 M)
  task heap:        4 GB     (executor max: 6 GB)

Rubric checks for

▸Identify skew on a specific key
▸Propose salting OR broadcast OR AQE skew join
▸Defend the choice against the data shape
▸Name 1 way to verify the fix worked

S2: The 2-hour join that should be 20 minutesHard · 30 min

events  table: 800 M rows, partitioned by event_date
users   table: 2 M rows, single partition
Join:  events.join(users, "user_id")
SLA:   20 min
Actual: 2 hours, 1 task stuck at 15.8 GB shuffle read

spark.sql.autoBroadcastJoinThreshold: 10 MB (default)

Rubric checks for

▸Recognize the small side is over threshold (users ~120 MB)
▸Choose between broadcast hint and threshold bump
▸Spot the hot key as separate issue (15.8 GB on 1 task)
▸Propose 2-stage fix: broadcast + salt

S3: OOM on a cached DataFrameMedium-Hard · 25 min

pipeline:
  raw = spark.read.parquet(...)         # 50 GB
  cleaned = raw.transform(...)
  cleaned.persist(MEMORY_ONLY)          # ← OOM here
  cleaned.count()                       # trigger
  # downstream uses cleaned 3 times

cluster: 4 executors, 16 GB heap each (total: 64 GB)
cached size at OOM: 78 GB

Rubric checks for

▸Realize cached size > total heap (compression vs raw)
▸Switch to MEMORY_AND_DISK or DISK_ONLY
▸Consider checkpoint() instead of persist()
▸Ask whether the 3 reuses actually need the same DF

S4: Schema evolution surpriseMedium · 20 min

Yesterday: spark.read.parquet("/data/events/") worked.
Today: AnalysisException: cannot resolve 'new_field'

upstream added a new optional column at 4pm yesterday.
production cluster has files from both before and after.

Rubric checks for

▸Explain Parquet schema-on-read behavior
▸Choose mergeSchema (with cost note) or explicit schema
▸Suggest CI test that would have caught this
▸Note: don't bury the bug, surface it in logs

S5: 50,000 small filesMedium · 20 min

nightly write:
  df.write.partitionBy("year","month","day","hour")
         .parquet("s3://lake/events")
result: 50,000 small files (~2 KB each) per day
downstream Presto LIST takes 12 minutes

Rubric checks for

▸Identify too-fine partitioning
▸Propose coalesce/repartition before write
▸Discuss bucketBy/sortBy as alternative
▸Question whether hourly partitioning is needed at all

PySpark interview practice FAQ

What kind of PySpark questions do FAANG and Spark-shop companies actually ask?+

2 layers. Layer 1 is coding: write a window function, write a groupBy with conditional aggregation, defend a broadcast vs sort-merge choice. Layer 2 is scenario: read a Spark UI screenshot, diagnose the bottleneck, propose a fix. At L4 the split is roughly 70/30 coding-to-scenario; at L5 it flips to 30/70; at L6+ the coding layer disappears entirely and the round is pure diagnosis and design.

Do I need a Databricks account for this?+

No. The PySpark sandbox runs in the browser. No Databricks Community Edition signup, no cluster start. The scenarios show real Spark UI metrics and physical plans; the AI interviewer pushes on the same tradeoffs a human interviewer would.

How is this different from reading a 'top 40 PySpark interview questions' blog?+

2 things. Reading teaches you what to say. The graded coding mode tests whether you can write it. Scenario mode tests whether you can defend it under follow-up questioning. Blog posts can't test correctness or interruption-handling; this can.

What's the most common PySpark interview mistake?+

Reaching for collect() or toPandas() mid-pipeline. Works on the practice problem, crashes at scale. The senior signal is recognizing that PySpark code should stay in DataFrame world until final output; collect() belongs after aggregation has reduced the size.

How long should I prep for a PySpark interview?+

Mid-level: 20-30 coding problems, 4-6 scenarios, 2 weeks. Senior: add 10-15 more coding problems and 4-6 more scenarios weighted on debugging, 3-4 weeks. PySpark has steeper marginal returns than SQL because the syntax is less intuitive and the scenario reasoning is more learnable.

Will the scenarios feel like a real production incident?+

They're modeled on production incidents from interview write-ups. The Spark UI metrics, the heap sizes, the partition counts are realistic to the scenarios as posed. The diagnostic shape is real; you can use these to rehearse your oncall debugging vocabulary even outside interview prep.

02 / Why practice

Walk through scenario S1

01
Active recall beats re-reading by 50%
Cognitive-science meta-reviews (Dunlosky et al., 2013) rank practice testing as a top-tier study technique, while re-reading and highlighting rank near the bottom
02
76% of hiring managers reject on the coding task, not the resume
From HackerRank's 2024 Developer Skills Report. Candidates who look strong on paper still fail the live screen if they haven't done timed, executable practice
03
Five problem shapes cover 80% of data engineer loops
Parsing and reshaping, sessionization, dedup with tie-breaks, streaming aggregation, top-N-per-group. Writing them by hand turns the unfamiliar into pattern recognition

Start scenario S1

Related practice

PySpark Coding Practice→

45 problems in coding mode against the same Spark sandbox.

PySpark Interview Questions→

Q+A format with worked answers by seniority level.

Spark Mock Interview→

Full multi-phase simulation: think, code, discuss, verdict.