PySpark Practice in Your Browser

PySpark practice usually means setting up Spark locally, signing into Databricks Community Edition and waiting for a cluster to start, or downloading a GitHub repo of Jupyter notebooks. The Spark sandbox here runs in the browser without any of that. Write DataFrame code, submit, the evaluator executes against a seeded Spark session and checks row counts, schema, and output.

Open the PySpark editor Random PySpark problem

Prepare for the interview

01 / Open invite

02min.

Know PySpark the way the interviewer who asks it knows it.

a PySpark query, the same shape a screen would give you.

The diff against expected. Where ties broke. What you missed.

sandbox

1def sessionize(events):

2 sessions = []

3 for e in events:

4 if gap_minutes(e) > 30:

Execute your solution0.4s avg.

ShopifyInterview question

Solve a PySpark problem

PySpark problems

PySpark 3.5

Sandbox version

Setup steps or cluster waits

AQE on

Default config

How the sandbox actually works

Where your code runs, what's pre-loaded, what gets returned.

Path from your tab to evaluator output

  your browser tab
  │
  │ POST /pyspark/evaluate
  │   body: { problem_id, code, seed }
  ▼
  ┌──────────────────────────────────────────┐
  │  sandbox dispatcher                      │
  │  spawns a Spark driver in a container    │
  │  preloads seed Parquet files into RAM    │
  └──────────────┬───────────────────────────┘
                 │
                 ▼
  ┌──────────────────────────────────────────┐
  │  PySpark 3.5 + Python 3.11               │
  │    SparkSession.master = local[4]        │
  │    spark.sql.adaptive.enabled = true     │
  │    inputs:  /seed/{problem}/seed_N/*.pq  │
  │    output:  toJSON().collect()           │
  └──────────────┬───────────────────────────┘
                 │
                 ▼
  evaluator compares output rows and schema
  to expected, returns:
    pass/fail per seed, schema diff, row diff,
    physical plan link (EXPLAIN cost view)

7 PySpark categories with their canonical idiom

Each category has a short example of what the answer looks like. The evaluator runs against 10 seeded Spark sessions.

DataFrame transformations12 problems · Easy-Medium

df.filter(F.col("amount") > 100)
  .withColumn("is_weekend",
              F.dayofweek("date").isin(1, 7))
  .select("user_id", "is_weekend", "amount")
  .orderBy(F.desc("amount"))

select, filter, withColumn, drop, alias, when/otherwise. Warm-up tier; fluency here keeps the rest of the interview moving.

Grouping and aggregation8 problems · Medium

df.groupBy("category", "month")
  .agg(
    F.sum("revenue").alias("rev"),
    F.countDistinct("customer_id").alias("uniq"),
    F.sum(F.when(F.col("is_new"), F.col("revenue"))
           .otherwise(0)).alias("new_cust_rev"),
  )

agg with named columns, conditional aggregation, distinct counts. The DataFrame equivalent of GROUP BY + HAVING + CASE WHEN.

Window functions8 problems · Medium

w = (Window
      .partitionBy("user_id")
      .orderBy(F.desc("event_at"), F.desc("event_id")))
df.withColumn("rn", F.row_number().over(w))
  .filter("rn = 1")
  .drop("rn")

Window.partitionBy, orderBy, row_number, lag, lead, running totals. PySpark window syntax differs from SQL in subtle ways the evaluator checks.

Joins and broadcast hints6 problems · Medium-Hard

events.join(
    F.broadcast(users), "user_id", "left"
).join(
    products, ["product_id"], "inner"
)

Inner, left, full, semi, anti joins. broadcast() hint for small dimensions. Default threshold is 10MB; the evaluator expects you to know that or look it up before guessing.

Skew handling4 problems · Hard

# salt the hot key
events_salted = events.withColumn(
    "salt", F.rand() * 20).cast("int")
users_salted = users.crossJoin(
    F.broadcast(spark.range(20).select(
        F.col("id").alias("salt"))))
events_salted.join(users_salted,
    ["user_id", "salt"], "left")

Salting hot keys, repartition vs coalesce, AQE settings, bucketed tables. The senior-level differentiator.

Performance debugging5 problems · Hard

# read physical plan
df.explain(mode="formatted")
# inspect via Spark UI metrics:
#   task durations, shuffle read/write,
#   GC time, executor heap

Read EXPLAIN output, identify shuffles, diagnose OOM. Problems provide Spark UI screenshots; you write the fix.

Caching and materialization2 problems · Medium

from pyspark.storagelevel import StorageLevel
df.persist(StorageLevel.MEMORY_AND_DISK)
df.count()  # trigger materialization
# downstream uses df 3 times
df.unpersist()

When to cache, what storage level, when to checkpoint instead, when to refactor and avoid the reuse.

Broadcast vs sort-merge, with physical plans

# Broadcast threshold by default is 10MB. The decision changes the plan.

events = spark.read.parquet("/seed/0/events.parquet")   # 800M rows
users  = spark.read.parquet("/seed/0/users.parquet")    # 2M rows, ~120MB

# AUTOMATIC: Spark sort-merges. users is over the 10MB threshold.
events.join(users, "user_id", "left")
#   == Physical Plan ==
#   SortMergeJoin [user_id], [user_id], LeftOuter
#     Sort [user_id ASC]                  cost: shuffle 18GB
#       Exchange hashpartitioning(user_id, 200)
#     Sort [user_id ASC]                  cost: shuffle 120MB
#       Exchange hashpartitioning(user_id, 200)

# HINT: force broadcast (users must fit in driver memory).
events.join(F.broadcast(users), "user_id", "left")
#   == Physical Plan ==
#   BroadcastHashJoin [user_id], [user_id], LeftOuter
#     BroadcastExchange HashedRelationBroadcastMode
#       (users replicated to every executor)
#     events scanned in place                cost: 0 shuffle

# CONFIG: raise the threshold and rely on AQE auto-broadcast.
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", "200MB")
events.join(users, "user_id", "left")
#   AQE detects users < 200MB, broadcasts at runtime

# When to pick which:
#   broadcast() hint  -> you know users is small and fits driver memory
#   raise threshold   -> repeated joins with similarly sized dims
#   sort-merge        -> users is genuinely large (> a few hundred MB)

The decision between broadcast and sort-merge is the most-tested PySpark optimization. Reading the EXPLAIN output is the senior signal.

PySpark practice options in May 2026

What each option offers, what it costs in setup time and money.

Resource	Execution	Problem count	Auto-scored	Performance tests	Cost
DataDriven (this site)	Browser, real Spark 3.5	45	Yes (10 seeds)	Yes	Free, no signup
areibman/pyspark_exercises (GitHub)	Local Jupyter	~70	No (self-check)	No	Free, requires install
Databricks Community Edition	Hosted Databricks	Self-built notebooks	No	Limited	Free signup, ~5 min cluster start
DataCamp DataLab	Hosted sandbox	Tutorials, not problems	Per lesson	Some	Subscription
DE Academy 30 Exercises	Code samples (no exec)	30	No	No	Course (paid)
StrataScratch	Some PySpark	Limited	Yes for SQL/Pandas	No	Premium for most

PySpark practice FAQ

Can I practice PySpark without installing Spark?+

Yes. The browser sandbox runs PySpark 3.5 against seeded Parquet files. No local install, no Docker, no Databricks signup, no cluster start wait. The sandbox diagram on this page shows the path from your tab to evaluator output.

What sandbox limits should I expect?+

Single Spark driver per session, local[4] executors, ~6GB heap. Datasets are capped at ~50M rows per problem. Enough for correctness practice; not enough for the 'process 1B rows' question. The senior interview interest is reasoning at scale, which the in-browser environment can test with smaller datasets that share the same skew patterns.

What kinds of PySpark questions appear in interviews?+

Mid-level: DataFrame transformations and groupBy aggregations. Senior: window functions, broadcast vs sort-merge selection, skew handling. Staff: design questions about partitioning, bucketing, and AQE configuration. The mix shifts from coding to debugging as seniority increases.

What's the broadcast threshold default?+

10MB (spark.sql.autoBroadcastJoinThreshold). Spark auto-broadcasts the smaller side if it fits. Above the threshold, Spark falls back to sort-merge join, which shuffles both sides. The broadcast hint forces broadcast regardless of size, with the cost that the broadcast must fit in driver memory.

Pandas, Polars, or PySpark?+

Depends on the target company. Pandas is generic across DE interviews. PySpark dominates at Spark shops (Databricks, Netflix, Uber, Airbnb). Polars is rare. The bank covers Pandas and PySpark; check /companies for which your target uses.

How many PySpark problems should I solve?+

20-30 for a mid-level PySpark round: 8 DataFrame, 5 groupBy, 5 window, 3 join, 3-5 debugging. Senior adds 10-15 more weighted on performance and design. PySpark has a steeper learning curve than SQL; the API is less intuitive.

Will my code see real Spark behavior, including shuffles?+

Yes. The sandbox is real Spark with AQE enabled by default. Shuffles happen, broadcast hints take effect, partition pruning works. The EXPLAIN output is real Spark physical plans. The only thing missing vs. a production cluster is the scale.

02 / Why practice

Open a DataFrame transformation problem

01
Active recall beats re-reading by 50%
Cognitive-science meta-reviews (Dunlosky et al., 2013) rank practice testing as a top-tier study technique, while re-reading and highlighting rank near the bottom
02
76% of hiring managers reject on the coding task, not the resume
From HackerRank's 2024 Developer Skills Report. Candidates who look strong on paper still fail the live screen if they haven't done timed, executable practice
03
Five problem shapes cover 80% of data engineer loops
Dedup, sessionization, top-N-per-group, slowly-changing dimensions, partition tricks. Writing the shapes by hand turns the unfamiliar into pattern recognition

Open the PySpark editor

Related practice

Browse PySpark Problems→

All 45 problems, filterable by category and difficulty.

PySpark Interview Mode→

Mock interview with AI follow-ups for PySpark rounds.

PySpark Cheat Sheet→

DataFrame API quick reference for interview lookup.