PySpark Practice in Your Browser
PySpark practice usually means setting up Spark locally, signing into Databricks Community Edition and waiting for a cluster to start, or downloading a GitHub repo of Jupyter notebooks. The Spark sandbox here runs in the browser without any of that. Write DataFrame code, submit, the evaluator executes against a seeded Spark session and checks row counts, schema, and output.
PySpark practice usually means setting up Spark locally, signing into Databricks Community Edition and waiting for a cluster to start, or downloading a GitHub repo of Jupyter notebooks. The Spark sandbox here runs in the browser without any of that. Write DataFrame code, submit, the evaluator executes against a seeded Spark session and checks row counts, schema, and output.
Know PySpark the way the interviewer who asks it knows it.
How the sandbox actually works
Where your code runs, what's pre-loaded, what gets returned.
your browser tab
│
│ POST /pyspark/evaluate
│ body: { problem_id, code, seed }
▼
┌──────────────────────────────────────────┐
│ sandbox dispatcher │
│ spawns a Spark driver in a container │
│ preloads seed Parquet files into RAM │
└──────────────┬───────────────────────────┘
│
▼
┌──────────────────────────────────────────┐
│ PySpark 3.5 + Python 3.11 │
│ SparkSession.master = local[4] │
│ spark.sql.adaptive.enabled = true │
│ inputs: /seed/{problem}/seed_N/*.pq │
│ output: toJSON().collect() │
└──────────────┬───────────────────────────┘
│
▼
evaluator compares output rows and schema
to expected, returns:
pass/fail per seed, schema diff, row diff,
physical plan link (EXPLAIN cost view)7 PySpark categories with their canonical idiom
Each category has a short example of what the answer looks like. The evaluator runs against 10 seeded Spark sessions.
df.filter(F.col("amount") > 100)
.withColumn("is_weekend",
F.dayofweek("date").isin(1, 7))
.select("user_id", "is_weekend", "amount")
.orderBy(F.desc("amount"))select, filter, withColumn, drop, alias, when/otherwise. Warm-up tier; fluency here keeps the rest of the interview moving.
df.groupBy("category", "month")
.agg(
F.sum("revenue").alias("rev"),
F.countDistinct("customer_id").alias("uniq"),
F.sum(F.when(F.col("is_new"), F.col("revenue"))
.otherwise(0)).alias("new_cust_rev"),
)agg with named columns, conditional aggregation, distinct counts. The DataFrame equivalent of GROUP BY + HAVING + CASE WHEN.
w = (Window
.partitionBy("user_id")
.orderBy(F.desc("event_at"), F.desc("event_id")))
df.withColumn("rn", F.row_number().over(w))
.filter("rn = 1")
.drop("rn")Window.partitionBy, orderBy, row_number, lag, lead, running totals. PySpark window syntax differs from SQL in subtle ways the evaluator checks.
events.join(
F.broadcast(users), "user_id", "left"
).join(
products, ["product_id"], "inner"
)Inner, left, full, semi, anti joins. broadcast() hint for small dimensions. Default threshold is 10MB; the evaluator expects you to know that or look it up before guessing.
# salt the hot key
events_salted = events.withColumn(
"salt", F.rand() * 20).cast("int")
users_salted = users.crossJoin(
F.broadcast(spark.range(20).select(
F.col("id").alias("salt"))))
events_salted.join(users_salted,
["user_id", "salt"], "left")Salting hot keys, repartition vs coalesce, AQE settings, bucketed tables. The senior-level differentiator.
# read physical plan df.explain(mode="formatted") # inspect via Spark UI metrics: # task durations, shuffle read/write, # GC time, executor heap
Read EXPLAIN output, identify shuffles, diagnose OOM. Problems provide Spark UI screenshots; you write the fix.
from pyspark.storagelevel import StorageLevel df.persist(StorageLevel.MEMORY_AND_DISK) df.count() # trigger materialization # downstream uses df 3 times df.unpersist()
When to cache, what storage level, when to checkpoint instead, when to refactor and avoid the reuse.
Broadcast vs sort-merge, with physical plans
# Broadcast threshold by default is 10MB. The decision changes the plan.
events = spark.read.parquet("/seed/0/events.parquet") # 800M rows
users = spark.read.parquet("/seed/0/users.parquet") # 2M rows, ~120MB
# AUTOMATIC: Spark sort-merges. users is over the 10MB threshold.
events.join(users, "user_id", "left")
# == Physical Plan ==
# SortMergeJoin [user_id], [user_id], LeftOuter
# Sort [user_id ASC] cost: shuffle 18GB
# Exchange hashpartitioning(user_id, 200)
# Sort [user_id ASC] cost: shuffle 120MB
# Exchange hashpartitioning(user_id, 200)
# HINT: force broadcast (users must fit in driver memory).
events.join(F.broadcast(users), "user_id", "left")
# == Physical Plan ==
# BroadcastHashJoin [user_id], [user_id], LeftOuter
# BroadcastExchange HashedRelationBroadcastMode
# (users replicated to every executor)
# events scanned in place cost: 0 shuffle
# CONFIG: raise the threshold and rely on AQE auto-broadcast.
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", "200MB")
events.join(users, "user_id", "left")
# AQE detects users < 200MB, broadcasts at runtime
# When to pick which:
# broadcast() hint -> you know users is small and fits driver memory
# raise threshold -> repeated joins with similarly sized dims
# sort-merge -> users is genuinely large (> a few hundred MB)The decision between broadcast and sort-merge is the most-tested PySpark optimization. Reading the EXPLAIN output is the senior signal.
PySpark practice options in May 2026
What each option offers, what it costs in setup time and money.
| Resource | Execution | Problem count | Auto-scored | Performance tests | Cost |
|---|---|---|---|---|---|
| DataDriven (this site) | Browser, real Spark 3.5 | 45 | Yes (10 seeds) | Yes | Free, no signup |
| areibman/pyspark_exercises (GitHub) | Local Jupyter | ~70 | No (self-check) | No | Free, requires install |
| Databricks Community Edition | Hosted Databricks | Self-built notebooks | No | Limited | Free signup, ~5 min cluster start |
| DataCamp DataLab | Hosted sandbox | Tutorials, not problems | Per lesson | Some | Subscription |
| DE Academy 30 Exercises | Code samples (no exec) | 30 | No | No | Course (paid) |
| StrataScratch | Some PySpark | Limited | Yes for SQL/Pandas | No | Premium for most |
PySpark practice FAQ
Can I practice PySpark without installing Spark?+
What sandbox limits should I expect?+
What kinds of PySpark questions appear in interviews?+
What's the broadcast threshold default?+
Pandas, Polars, or PySpark?+
How many PySpark problems should I solve?+
Will my code see real Spark behavior, including shuffles?+
Open a DataFrame transformation problem
- 01
Active recall beats re-reading by 50%
Cognitive-science meta-reviews (Dunlosky et al., 2013) rank practice testing as a top-tier study technique, while re-reading and highlighting rank near the bottom
- 02
76% of hiring managers reject on the coding task, not the resume
From HackerRank's 2024 Developer Skills Report. Candidates who look strong on paper still fail the live screen if they haven't done timed, executable practice
- 03
Five problem shapes cover 80% of data engineer loops
Dedup, sessionization, top-N-per-group, slowly-changing dimensions, partition tricks. Writing the shapes by hand turns the unfamiliar into pattern recognition