PySpark practice problems for data engineer interview prep. Live Spark sandbox with skew-engineered test data. DataFrame transformations and actions. Broadcast versus sort-merge join decisions. Salt-and-rebalance for hot keys. Spark UI reading. Built for the 30 percent of data engineer loops at Spark-first companies (Databricks, Netflix, Uber, Airbnb).

PySpark practice problems for data engineer roles run in a live Spark sandbox with test data engineered to surface common Spark bugs. Each problem has public test cases visible in the prompt and hidden test cases that include skew-engineered partitions (one partition with 10x median data volume), null-heavy columns that break naive aggregations, and a performance budget that fails solutions taking longer than the canonical baseline by more than 3x.

The catalog covers six question shapes. Join an 800M-row events table with a 2M-row users table: broadcast users, defend the threshold choice, state the spark.sql.autoBroadcastJoinThreshold setting. Same problem with 800M-by-800M: sort-merge join, partition strategy, repartition before join. Aggregate by user_id where 5 percent of users have 95 percent of events: identify skew with df.groupBy.count.orderBy.limit, salt with mod-N suffix, aggregate by salted key, unsalt and re-aggregate. Implement SCD Type 2 merge in DataFrame API: window functions in PySpark or MERGE INTO in Delta or the manual append-and-expire pattern. Read a Spark UI screenshot showing one task at 8x median time: identify the skewed partition, explain what to change. The explain plan question: df.explain() showing a SortMergeJoin where a BroadcastHashJoin would be faster, what setting to flip and why.

Spark UI reading is the senior-versus-mid signal. The practice catalog includes 8 screenshots with specific anomalies for the data engineer to diagnose. Screenshot 1: task duration distribution skewed 10x at max; identify the hot key. Screenshot 2: shuffle read 10x at max; identify data skew. Screenshot 3: spill memory greater than 0; identify memory pressure. Screenshot 4: number of tasks much less than partition count; identify under-parallelism. Screenshot 5: GC time greater than 10 percent of task time; identify garbage collection pressure. Each comes with a rubric verdict naming the cause and the fix.

AQE (Adaptive Query Execution) handling. Spark 3.0+ has AQE on by default in 3.2+; the data engineer should know what AQE does (skew-join detection, broadcast-threshold adjustment, partition coalescing) and when to override. Practice problems include scenarios where AQE catches the skew automatically and scenarios where manual intervention is needed because AQE cannot identify the skew at the right stage boundary.

Delta and Iceberg MERGE INTO patterns. The practice catalog includes problems where the destination is a Delta table or Iceberg table and the operation is an upsert with MERGE INTO. The PySpark code uses DeltaTable.forPath().merge(...) or Iceberg's spark.sql("MERGE INTO ..."). Common bugs: forgetting to handle the DELETE case (events with op_type = 'DELETE' that should remove rows), using REPLACE semantics when ADD is needed for late-arriving aggregates.

PySpark Practice Problems

Live PySpark practice problems for data engineer interview prep.

Common questions

Does the PySpark sandbox run real Spark?
Yes. Every submission runs in a live Spark sandbox with public test cases (visible in the prompt) and hidden test cases (skew-engineered partitions, null-heavy columns, performance budgets). Submissions return per-test results with the specific failure for failing cases.
What Spark version does the sandbox use?
Spark 3.4+ with AQE (Adaptive Query Execution) enabled by default. PySpark 3.4+ supports pandas-on-Spark API as well as the traditional DataFrame API. Iceberg and Delta connectors are available for MERGE INTO patterns.
Are these PySpark practice problems calibrated to specific companies?
Yes. The catalog includes tagged problems for Databricks (Delta MERGE patterns, Photon, Unity Catalog), Netflix (Iceberg, structured streaming, Spark UI deep dives), Uber (large-scale batch and Spark Streaming), Airbnb (Spark plus Druid, Airflow orchestration), DoorDash and Spotify (similar stacks). Filter by company tag to focus on the bar at your target.
How is skew engineered in the practice catalog?
Test data for relevant problems is generated with intentional skew: 5 percent of join keys have 95 percent of the rows, or one partition has 10x the median data volume. Naive solutions pass public tests on uniform data but fail hidden tests on skewed data. The fix is to identify the hot key and apply salt-and-rebalance.
What is the Spark UI reading question format?
Each Spark UI question presents a screenshot (Summary Metrics row, Tasks table, or Stage detail) with a specific anomaly. The data engineer identifies the cause (skew, under-parallelism, memory pressure, GC) and proposes the fix. The rubric scores on cause identification and fix correctness. Practice catalog has 8 screenshots covering the main anomaly types.
How does AQE handling appear in practice problems?
Some problems are scenarios where AQE catches the issue automatically (skew at the join, runtime broadcast decision). Others are scenarios where AQE cannot help because the skew is at a stage boundary AQE does not optimize, and manual intervention is needed. The data engineer who knows when AQE works and when to override scores above the L4 bar.
Does the practice catalog include Delta and Iceberg patterns?
Yes. Problems include MERGE INTO on Delta and Iceberg tables for upsert, time travel for backfill, schema evolution, and partition pruning. PySpark code uses DeltaTable.forPath().merge or Iceberg's spark.sql MERGE INTO. Common bugs (forgetting DELETE handling, REPLACE-when-ADD-needed) are engineered into the hidden test cases.
How many PySpark practice problems should a data engineer solve before a Spark-first onsite?
40-60 problems with skew-engineered and Spark UI tests beats 100 textbook PySpark problems. Aim for fluency on the four core question shapes (broadcast join, sort-merge join, skew handling, SCD Type 2 merge) plus the 8 Spark UI screenshots. Two timed mock PySpark coding rounds in the final week.