Question 1

Does the PySpark sandbox run real Spark?

Accepted Answer

Yes. Every submission runs in a live Spark sandbox with public test cases (visible in the prompt) and hidden test cases (skew-engineered partitions, null-heavy columns, performance budgets). Submissions return per-test results with the specific failure for failing cases.

Question 2

What Spark version does the sandbox use?

Accepted Answer

Spark 3.4+ with AQE (Adaptive Query Execution) enabled by default. PySpark 3.4+ supports pandas-on-Spark API as well as the traditional DataFrame API. Iceberg and Delta connectors are available for MERGE INTO patterns.

Question 3

Are these PySpark practice problems calibrated to specific companies?

Accepted Answer

Yes. The catalog includes tagged problems for Databricks (Delta MERGE patterns, Photon, Unity Catalog), Netflix (Iceberg, structured streaming, Spark UI deep dives), Uber (large-scale batch and Spark Streaming), Airbnb (Spark plus Druid, Airflow orchestration), DoorDash and Spotify (similar stacks). Filter by company tag to focus on the bar at your target.

Question 4

How is skew engineered in the practice catalog?

Accepted Answer

Test data for relevant problems is generated with intentional skew: 5 percent of join keys have 95 percent of the rows, or one partition has 10x the median data volume. Naive solutions pass public tests on uniform data but fail hidden tests on skewed data. The fix is to identify the hot key and apply salt-and-rebalance.

Question 5

What is the Spark UI reading question format?

Accepted Answer

Each Spark UI question presents a screenshot (Summary Metrics row, Tasks table, or Stage detail) with a specific anomaly. The data engineer identifies the cause (skew, under-parallelism, memory pressure, GC) and proposes the fix. The rubric scores on cause identification and fix correctness. Practice catalog has 8 screenshots covering the main anomaly types.

Question 6

How does AQE handling appear in practice problems?

Accepted Answer

Some problems are scenarios where AQE catches the issue automatically (skew at the join, runtime broadcast decision). Others are scenarios where AQE cannot help because the skew is at a stage boundary AQE does not optimize, and manual intervention is needed. The data engineer who knows when AQE works and when to override scores above the L4 bar.

Question 7

Does the practice catalog include Delta and Iceberg patterns?

Accepted Answer

Yes. Problems include MERGE INTO on Delta and Iceberg tables for upsert, time travel for backfill, schema evolution, and partition pruning. PySpark code uses DeltaTable.forPath().merge or Iceberg's spark.sql MERGE INTO. Common bugs (forgetting DELETE handling, REPLACE-when-ADD-needed) are engineered into the hidden test cases.

Question 8

How many PySpark practice problems should a data engineer solve before a Spark-first onsite?

Accepted Answer

40-60 problems with skew-engineered and Spark UI tests beats 100 textbook PySpark problems. Aim for fluency on the four core question shapes (broadcast join, sort-merge join, skew handling, SCD Type 2 merge) plus the 8 Spark UI screenshots. Two timed mock PySpark coding rounds in the final week.

PySpark Practice Problems

PySpark Practice Problems

PySpark (8)