PySpark coding questions for data engineer interview prep. Open-ended write-the-PySpark problems shaped like the Spark-first data engineer coding round at Databricks, Netflix, Uber, Airbnb. DataFrame transformations and actions. Broadcast versus sort-merge join. Window functions in Spark. SCD Type 2 merge with Delta or Iceberg. Spark UI reading.

PySpark coding questions in a Spark-first data engineer interview are open-ended: a real data scenario, a couple of DataFrames with schema, a business question, 30 to 60 minutes to write working PySpark code. The interviewer hands the candidate two DataFrames (events with 800M rows, users with 2M rows) and asks "join these and compute the per-user engagement metric". The first scored moment is the join-strategy clarifying question (can we broadcast users? what is the cluster's executor memory?). The second is the actual code. The third is reading the resulting Spark UI to identify any anomalies.

The four PySpark coding question shapes that compose 80 percent of Spark-first data engineer interview rounds. Join shape: large fact with smaller dim, decide broadcast versus sort-merge, defend the threshold. Aggregation shape: GROUP BY with potential skew, identify the hot key, salt-and-rebalance. Window function shape: ROW_NUMBER for dedup or LAG for sessionization, expressed in PySpark Window class. Merge shape: SCD Type 2 upsert on a Delta or Iceberg destination table.

The PySpark Window API is different from SQL window functions in syntax. from pyspark.sql.window import Window. windowSpec = Window.partitionBy("user_id").orderBy(col("updated_at").desc()). df.withColumn("rn", row_number().over(windowSpec)).filter(col("rn") == 1) for dedup-latest. The frame clause is rowsBetween or rangeBetween. The data engineer fluent in SQL window functions translates to PySpark Window directly; the syntactic differences are small.

The SCD Type 2 merge in PySpark uses Delta or Iceberg MERGE INTO. With Delta: from delta.tables import DeltaTable. deltaTable = DeltaTable.forPath(spark, "/path/to/table"). deltaTable.alias("d").merge(stagingDf.alias("s"), "d.customer_id = s.customer_id AND d.is_current = true").whenMatchedUpdate(condition = "d.address != s.address", set = {"is_current": "false", "effective_to": "current_timestamp()"}).whenNotMatchedInsert(...).execute(). Common bug: forgetting the is_current = true predicate in the matchedUpdate condition causes infinite re-expiration of historical rows.

Spark UI reading questions present a screenshot and ask the data engineer to identify the cause and propose the fix. Skew shows as one task with 10x median duration and shuffle read; fix is salt-and-rebalance. Under-parallelism shows as task count less than partition count; fix is repartition or spark.sql.shuffle.partitions tuning. Memory pressure shows as spill greater than 0; fix is broader executor memory or smaller partition size. GC pressure shows as GC time greater than 10 percent of task time; fix is broader heap with G1GC.

Companies whose data engineer interviews emphasize PySpark coding: Databricks (Spark itself, Photon, Delta), Netflix (Iceberg and Mantis), Uber (large-scale batch and Spark Streaming), Airbnb (Spark plus Druid), DoorDash, Spotify, Capital One.

PySpark Coding Questions

Open-ended PySpark coding questions for data engineer interview prep.

Common questions

What is the format of a PySpark coding question in a Spark-first interview?
Open-ended: real data scenario, two DataFrames with schema, business question, 30-60 minutes. The data engineer asks clarifying questions (can we broadcast? cluster memory?), proposes an approach, writes the PySpark code, and discusses the resulting Spark UI for anomalies.
How does a PySpark coding question differ from a SQL coding question?
PySpark questions emphasize the join strategy decision (broadcast vs sort-merge), the partition strategy, the Spark UI reading. SQL questions emphasize the query correctness and edge case handling. Both share the underlying patterns (join, aggregation, window functions); the PySpark version also tests cluster awareness and optimization.
What is the PySpark Window API?
from pyspark.sql.window import Window. windowSpec = Window.partitionBy('user_id').orderBy(col('updated_at').desc()). df.withColumn('rn', row_number().over(windowSpec)).filter(col('rn') == 1) for dedup-latest. Frame clause is rowsBetween or rangeBetween. The SQL window function translates directly to PySpark Window with small syntactic differences.
How does a data engineer implement SCD Type 2 merge in PySpark?
With Delta: deltaTable.alias('d').merge(stagingDf.alias('s'), 'd.pk = s.pk AND d.is_current = true').whenMatchedUpdate(set = {'is_current': 'false', 'effective_to': 'current_timestamp()'}).whenNotMatchedInsert(...).execute(). With Iceberg: spark.sql('MERGE INTO target t USING staging s ON ... WHEN MATCHED THEN UPDATE ... WHEN NOT MATCHED THEN INSERT ...'). Forgetting is_current = true in the matched condition is the common bug.
What is the most common bug in PySpark interview code?
Forgetting to handle late-arriving events in the dedup logic. A naive dedup with ROW_NUMBER ORDER BY updated_at DESC works on uniform data but fails when two events for the same key have the same updated_at (multi-millisecond batch arrival). Add a composite tiebreaker (event_id DESC or source_ASC). The hidden test cases catch this.
How does a data engineer read a Spark UI screenshot in an interview?
Look at the Summary Metrics row. Max duration 10x the median = task skew. Max shuffle read 10x the median = data skew. Max spill memory greater than 0 = memory pressure. GC time greater than 10 percent = garbage collection pressure. The Tasks table sorted by duration descending shows the culprit partition.
What is the difference between PySpark DataFrame API and SparkSQL?
PySpark DataFrame API uses Python methods on DataFrame objects (df.filter, df.groupBy, df.join, df.withColumn). SparkSQL uses string SQL queries (spark.sql('SELECT ... FROM ... WHERE ...')). Both compile to the same physical plan via Catalyst. Interview rounds usually allow either; pick the one you are faster in. Mention the equivalence to show fluency.
When does a data engineer use RDDs versus DataFrames in PySpark?
DataFrames for almost everything in 2026. RDDs only when you need fine-grained control over partitioning, when working with non-tabular data that does not fit a schema, or when interfacing with legacy code. Most modern PySpark interview rounds expect DataFrame API; mentioning RDDs as the underlying abstraction shows depth but using them as your primary tool signals out-of-date practice.