Question 1

What is the format of a PySpark coding question in a Spark-first interview?

Accepted Answer

Open-ended: real data scenario, two DataFrames with schema, business question, 30-60 minutes. The data engineer asks clarifying questions (can we broadcast? cluster memory?), proposes an approach, writes the PySpark code, and discusses the resulting Spark UI for anomalies.

Question 2

How does a PySpark coding question differ from a SQL coding question?

Accepted Answer

PySpark questions emphasize the join strategy decision (broadcast vs sort-merge), the partition strategy, the Spark UI reading. SQL questions emphasize the query correctness and edge case handling. Both share the underlying patterns (join, aggregation, window functions); the PySpark version also tests cluster awareness and optimization.

Question 3

What is the PySpark Window API?

Accepted Answer

from pyspark.sql.window import Window. windowSpec = Window.partitionBy('user_id').orderBy(col('updated_at').desc()). df.withColumn('rn', row_number().over(windowSpec)).filter(col('rn') == 1) for dedup-latest. Frame clause is rowsBetween or rangeBetween. The SQL window function translates directly to PySpark Window with small syntactic differences.

Question 4

How does a data engineer implement SCD Type 2 merge in PySpark?

Accepted Answer

With Delta: deltaTable.alias('d').merge(stagingDf.alias('s'), 'd.pk = s.pk AND d.is_current = true').whenMatchedUpdate(set = {'is_current': 'false', 'effective_to': 'current_timestamp()'}).whenNotMatchedInsert(...).execute(). With Iceberg: spark.sql('MERGE INTO target t USING staging s ON ... WHEN MATCHED THEN UPDATE ... WHEN NOT MATCHED THEN INSERT ...'). Forgetting is_current = true in the matched condition is the common bug.

Question 5

What is the most common bug in PySpark interview code?

Accepted Answer

Forgetting to handle late-arriving events in the dedup logic. A naive dedup with ROW_NUMBER ORDER BY updated_at DESC works on uniform data but fails when two events for the same key have the same updated_at (multi-millisecond batch arrival). Add a composite tiebreaker (event_id DESC or source_ASC). The hidden test cases catch this.

Question 6

How does a data engineer read a Spark UI screenshot in an interview?

Accepted Answer

Look at the Summary Metrics row. Max duration 10x the median = task skew. Max shuffle read 10x the median = data skew. Max spill memory greater than 0 = memory pressure. GC time greater than 10 percent = garbage collection pressure. The Tasks table sorted by duration descending shows the culprit partition.

Question 7

What is the difference between PySpark DataFrame API and SparkSQL?

Accepted Answer

PySpark DataFrame API uses Python methods on DataFrame objects (df.filter, df.groupBy, df.join, df.withColumn). SparkSQL uses string SQL queries (spark.sql('SELECT ... FROM ... WHERE ...')). Both compile to the same physical plan via Catalyst. Interview rounds usually allow either; pick the one you are faster in. Mention the equivalence to show fluency.

Question 8

When does a data engineer use RDDs versus DataFrames in PySpark?

Accepted Answer

DataFrames for almost everything in 2026. RDDs only when you need fine-grained control over partitioning, when working with non-tabular data that does not fit a schema, or when interfacing with legacy code. Most modern PySpark interview rounds expect DataFrame API; mentioning RDDs as the underlying abstraction shows depth but using them as your primary tool signals out-of-date practice.

PySpark Coding Questions

PySpark Coding Questions

PySpark (6)