PySpark Interview Practice (Coding + Scenarios)

A PySpark interview is 2 different rounds wedged into 1. Round 1 is coding: write a DataFrame query that produces the right output. Round 2 is diagnosis: read a Spark UI screenshot, figure out why the job is slow, propose a fix. Almost every PySpark prep site addresses round 1. The scenarios below address round 2.

A PySpark interview is 2 different rounds wedged into 1. Round 1 is coding: write a DataFrame query that produces the right output. Round 2 is diagnosis: read a Spark UI screenshot, figure out why the job is slow, propose a fix. Almost every PySpark prep site addresses round 1. The scenarios below address round 2.

Prepare for the interview
01 / Open invite
02min.

Know PySpark the way the interviewer who asks it knows it.

a PySpark query, the same shape a screen would give you.
The diff against expected. Where ties broke. What you missed.
sandbox
1def sessionize(events):
2 sessions = []
3 for e in events:
4 if gap_minutes(e) > 30:
5
Execute your solution0.4s avg.
ShopifyInterview question
Solve a PySpark problem
30
Coding problems
12
Scenario walkthroughs
AI
Pushes you on tradeoffs
PySpark 3.5
Real sandbox, no signup

Coding vs scenario vs design, by seniority level

PySpark interviews shift toward debugging and design as level rises. The mix is roughly what to expect.

PySpark interview question types, by seniority level
Mid-level (L4 / SDE II)70c·20s·10d
Senior (L5 / SDE III)45c·35s·20d
Staff (L6 / Principal)25c·35s·40d
Coding Scenario / debug Design

Coding mode: 6 patterns interviewers reach for

Each problem has a multi-seed grader. Tiebreakers and frame clauses get checked explicitly.

01DataFrame fluency
Easy · ~10 min

Given a 50M-row transactions DataFrame, derive is_weekend, filter to weekend rows, sort by amount DESC, return the top 100.

select, filter, withColumn, orderBy, limit

02groupBy with conditional aggregation
Easy-Medium · ~12 min

Per category and month, compute total revenue, distinct customer count, and percentage of revenue from new customers in a single agg().

agg, F.when, F.countDistinct, named columns

03Top-N per group via window
Medium · ~15 min

For each customer, return the 3 highest-value orders. Tiebreak on ordered_at DESC, then order_id ASC. The grader checks ties explicitly.

Window.partitionBy, orderBy with tiebreaker, row_number

04Sessionization
Medium-Hard · ~20 min

Group events into 30-minute timeout sessions per user. Return user_id, session_id, start, end, event_count.

F.lag, F.when, F.sum as accumulator, groupBy

05Broadcast vs sort-merge selection
Medium-Hard · ~15 min

Join 8M-row events with 2M-row users. Pick the right join strategy and defend the choice with the data sizes.

F.broadcast hint, autoBroadcastJoinThreshold, explain plan reading

06Salting a hot key
Hard · ~25 min

Given evidence of a hot key (1 task at 15.8GB shuffle read while others are 100MB), implement salting and unsalt the result correctly.

F.rand, crossJoin, multi-key join, post-join aggregation

Scenario mode: 5 production incidents to walk through

Each scenario shows the pager-grade evidence. The AI interviewer asks what you'd check, what you'd try, and why. Verdict scores against the rubric.

S1: The 78% GC overhead mysteryHard · 35 min
199 tasks finish in 14-22 seconds.
Task 200 runs for 7,140 seconds with 78% GC overhead.
The job ran fine last week.

Spark UI metrics for the slow task:
  shuffle read:     12.4 GB  (others: 80-120 MB)
  records read:     412 M    (others: 6-8 M)
  task heap:        4 GB     (executor max: 6 GB)
Rubric checks for
  • Identify skew on a specific key
  • Propose salting OR broadcast OR AQE skew join
  • Defend the choice against the data shape
  • Name 1 way to verify the fix worked
S2: The 2-hour join that should be 20 minutesHard · 30 min
events  table: 800 M rows, partitioned by event_date
users   table: 2 M rows, single partition
Join:  events.join(users, "user_id")
SLA:   20 min
Actual: 2 hours, 1 task stuck at 15.8 GB shuffle read

spark.sql.autoBroadcastJoinThreshold: 10 MB (default)
Rubric checks for
  • Recognize the small side is over threshold (users ~120 MB)
  • Choose between broadcast hint and threshold bump
  • Spot the hot key as separate issue (15.8 GB on 1 task)
  • Propose 2-stage fix: broadcast + salt
S3: OOM on a cached DataFrameMedium-Hard · 25 min
pipeline:
  raw = spark.read.parquet(...)         # 50 GB
  cleaned = raw.transform(...)
  cleaned.persist(MEMORY_ONLY)          # ← OOM here
  cleaned.count()                       # trigger
  # downstream uses cleaned 3 times

cluster: 4 executors, 16 GB heap each (total: 64 GB)
cached size at OOM: 78 GB
Rubric checks for
  • Realize cached size > total heap (compression vs raw)
  • Switch to MEMORY_AND_DISK or DISK_ONLY
  • Consider checkpoint() instead of persist()
  • Ask whether the 3 reuses actually need the same DF
S4: Schema evolution surpriseMedium · 20 min
Yesterday: spark.read.parquet("/data/events/") worked.
Today: AnalysisException: cannot resolve 'new_field'

upstream added a new optional column at 4pm yesterday.
production cluster has files from both before and after.
Rubric checks for
  • Explain Parquet schema-on-read behavior
  • Choose mergeSchema (with cost note) or explicit schema
  • Suggest CI test that would have caught this
  • Note: don't bury the bug, surface it in logs
S5: 50,000 small filesMedium · 20 min
nightly write:
  df.write.partitionBy("year","month","day","hour")
         .parquet("s3://lake/events")
result: 50,000 small files (~2 KB each) per day
downstream Presto LIST takes 12 minutes
Rubric checks for
  • Identify too-fine partitioning
  • Propose coalesce/repartition before write
  • Discuss bucketBy/sortBy as alternative
  • Question whether hourly partitioning is needed at all

PySpark interview practice FAQ

What kind of PySpark questions do FAANG and Spark-shop companies actually ask?+
2 layers. Layer 1 is coding: write a window function, write a groupBy with conditional aggregation, defend a broadcast vs sort-merge choice. Layer 2 is scenario: read a Spark UI screenshot, diagnose the bottleneck, propose a fix. The bar chart above shows how the mix shifts with seniority.
Do I need a Databricks account for this?+
No. The PySpark sandbox runs in the browser. No Databricks Community Edition signup, no cluster start. The scenarios show real Spark UI metrics and physical plans; the AI interviewer pushes on the same tradeoffs a human interviewer would.
How is this different from reading a 'top 40 PySpark interview questions' blog?+
2 things. Reading teaches you what to say. The graded coding mode tests whether you can write it. Scenario mode tests whether you can defend it under follow-up questioning. Blog posts can't test correctness or interruption-handling; this can.
What's the most common PySpark interview mistake?+
Reaching for collect() or toPandas() mid-pipeline. Works on the practice problem, crashes at scale. The senior signal is recognizing that PySpark code should stay in DataFrame world until final output; collect() belongs after aggregation has reduced the size.
How long should I prep for a PySpark interview?+
Mid-level: 20-30 coding problems, 4-6 scenarios, 2 weeks. Senior: add 10-15 more coding problems and 4-6 more scenarios weighted on debugging, 3-4 weeks. PySpark has steeper marginal returns than SQL because the syntax is less intuitive and the scenario reasoning is more learnable.
Will the scenarios feel like a real production incident?+
They're modeled on production incidents from interview write-ups. The Spark UI metrics, the heap sizes, the partition counts are realistic to the scenarios as posed. The diagnostic shape is real; you can use these to rehearse your oncall debugging vocabulary even outside interview prep.
02 / Why practice

Walk through scenario S1

  1. 01

    Active recall beats re-reading by 50%

    Cognitive-science meta-reviews (Dunlosky et al., 2013) rank practice testing as a top-tier study technique, while re-reading and highlighting rank near the bottom

  2. 02

    76% of hiring managers reject on the coding task, not the resume

    From HackerRank's 2024 Developer Skills Report. Candidates who look strong on paper still fail the live screen if they haven't done timed, executable practice

  3. 03

    Five problem shapes cover 80% of data engineer loops

    Dedup, sessionization, top-N-per-group, slowly-changing dimensions, partition tricks. Writing the shapes by hand turns the unfamiliar into pattern recognition

Related practice