Question 1

What is the difference between broadcast join and sort-merge join in Spark?

Accepted Answer

Broadcast join replicates the smaller table to every executor and joins locally; no shuffle. Fast when the small side fits in driver memory (default 10MB threshold in Spark, raise to 100MB+ for typical interview scenarios). Sort-merge join shuffles both sides by the join key, sorts within partitions, then merges; required when both sides are large. The Spark optimizer picks broadcast when a side is below spark.sql.autoBroadcastJoinThreshold; you can force with broadcast() hint or disable with the setting.

Question 2

How does a data engineer identify skew in a Spark job?

Accepted Answer

Open the Spark UI, go to the Stages tab, click the slow stage, look at the Tasks table sorted by duration. Skew shows as one or two tasks taking 5x to 10x the median time with corresponding 5x to 10x shuffle read or write. The Summary Metrics at the top of the stage page show min, 25th, 50th, 75th, max; a max that is 10x the median is the smoking gun.

Question 3

What is the salt-and-rebalance technique for skewed joins?

Accepted Answer

Append a mod-N suffix to the hot key on both sides (typically N=8 to 32). The small side has its rows replicated N times (one per suffix value); the large side has the suffix appended to existing rows. Joining on the salted key spreads the formerly-hot key across N partitions instead of one. Aggregate by salted key, then unsalt by stripping the suffix and re-aggregating. Trade-off: N-fold replication of the small side and an extra aggregation step versus the original one-task-doing-all-work bottleneck.

Question 4

What is Adaptive Query Execution (AQE) in Spark?

Accepted Answer

AQE in Spark 3.0+ adjusts the query plan at runtime based on actual statistics from completed stages. Three main optimizations: skew-join detection (splits a skewed partition into multiple smaller ones automatically), broadcast-threshold adjustment (uses runtime data sizes to decide broadcast vs sort-merge), and partition coalescing (combines small post-shuffle partitions). Enable with spark.sql.adaptive.enabled=true (default in 3.2+). Override when manually-optimized plans are needed.

Question 5

When should a data engineer cache versus persist versus checkpoint a DataFrame?

Accepted Answer

cache() stores in memory (MEMORY_AND_DISK fallback). persist() lets you specify the storage level explicitly. checkpoint() writes to durable storage and truncates the lineage; required when the lineage is so long that recovery on executor failure would be more expensive than the checkpoint. Cache when reusing a DataFrame multiple times in the same job. Checkpoint when you have gone through 10+ wide transformations and want to break the lineage chain.

Question 6

What is the difference between repartition and coalesce in Spark?

Accepted Answer

repartition(N) shuffles to N partitions; expensive but lets you increase or decrease and gives even distribution. coalesce(N) merges existing partitions without shuffle; cheap, but only decreases and can produce uneven distribution. Use repartition when going from few-and-skewed to many-and-balanced (typically before a write or a join). Use coalesce when combining small partitions before writing fewer output files.

Question 7

How does a data engineer read a Spark UI screenshot showing stage skew?

Accepted Answer

Look at the Summary Metrics row. Max duration 10x the median equals task skew. Max shuffle read 10x the median equals data skew. Max spill memory greater than 0 equals memory pressure. The Tasks table sorted by duration descending shows which partition is the culprit. If one task is at 8x median time and 10x shuffle read, the join key has a hot value; salt and rebalance.

Question 8

What is the difference between DataFrame and Dataset in PySpark?

Accepted Answer

PySpark has only DataFrames; Datasets are Scala/Java-only because Python's type system cannot express the strongly-typed Dataset API. In PySpark, the DataFrame API plus PySpark UDFs and pandas UDFs cover the workload. For type safety in interviews, mention Scala Datasets as the equivalent if the role is Scala-Spark.

PySpark Interview Questions

PySpark Interview Questions

PySpark (12)