PySpark interview questions pulled from Spark-first data engineer interview reports at Databricks, Netflix, Uber, Airbnb, DoorDash, Spotify. DataFrame fluency, join strategy selection (broadcast versus sort-merge), skew diagnosis with salt-and-rebalance, broadcast versus shuffle decisions, and Spark UI screenshot reading at L5+.

PySpark is the dedicated Python-on-Spark coding round at roughly 30 percent of data engineer interview loops where Spark is in production. Spark-first companies all run a 45-to-60-minute PySpark coding round in addition to the SQL round. Databricks (where Spark was created), Netflix (hundreds of thousands of Spark jobs per day, Iceberg, Mantis), Uber (Spark plus Hive plus Presto stack, Marmaray for ingest), Airbnb (Spark plus Airflow plus Druid), DoorDash, Spotify, Capital One, Comcast. The bar tests four skills.

DataFrame fluency: the SQL-equivalent operations expressed as DataFrame transformations. df.filter, df.groupBy.agg, df.join, df.window. The data engineer should be fluent enough to translate SQL to DataFrame and back without thinking. Common stumbling block: the difference between transformations (lazy, return a new DataFrame) and actions (eager, return a value or write data). collect() pulls all rows to the driver and OOMs on large data; show() displays a sample; count() forces computation but returns just a number. Senior interviewers fish for this.

Join strategy selection. Broadcast join for small-and-large (Spark replicates the small side to every executor; no shuffle; fast when the small side fits in driver memory at default 10MB threshold, configurable to 100MB+). Sort-merge join for large-and-large (shuffle both sides by the join key, sort within partitions, merge). The Spark optimizer picks broadcast when a side is below spark.sql.autoBroadcastJoinThreshold; you can force with broadcast() hint or disable with the setting. AQE (Adaptive Query Execution) in Spark 3.0+ adjusts at runtime based on actual statistics from completed stages.

Skew diagnosis. Identify hot keys with df.groupBy(key).count().orderBy(desc("count")).limit(20). If the top key has 10x+ the median count, salt: append a mod-N suffix to the hot key on both sides (CONCAT(key, '_', user_id mod 8)), join, aggregate by salted key, then unsalt and re-aggregate. Trade-off is N-fold replication of the small side plus an extra aggregation pass versus the original one-task-doing-all-work bottleneck. In Spark 3.0+ AQE skew-join is automatic for joins where the skew is detected at runtime; the data engineer should know what AQE does and when to override.

Spark UI screenshot reading is the senior-versus-mid signal. Mid-level data engineer candidates can write the DataFrame code but cannot read the UI when something is slow. Senior candidates look at the stage detail first: number of tasks (should match partition count), task duration distribution (median vs p99), shuffle read and write per task (skew shows as one task with 10x the shuffle of others), and spill (memory pressure indicator). A typical interview question: screenshot shows 1 task at 8x median duration with 10x shuffle read. Identify the cause (skew on the join key) and propose the fix (salt-and-rebalance).

Companies whose data engineer interviews emphasize Spark heavily: Databricks (Spark itself, Photon engine, Delta MERGE INTO, Unity Catalog), Netflix (Iceberg table format, structured streaming, Spark UI deep dives), Uber (large-scale batch Spark, Spark Streaming), Airbnb (Spark plus Druid, Airflow orchestration), DoorDash and Spotify (similar Spark+Kafka+Snowflake/BigQuery stacks).

PySpark Interview Questions

Live PySpark interview questions for data engineer roles, including skew diagnosis and Spark UI reading.

Common questions

What is the difference between broadcast join and sort-merge join in Spark?
Broadcast join replicates the smaller table to every executor and joins locally; no shuffle. Fast when the small side fits in driver memory (default 10MB threshold in Spark, raise to 100MB+ for typical interview scenarios). Sort-merge join shuffles both sides by the join key, sorts within partitions, then merges; required when both sides are large. The Spark optimizer picks broadcast when a side is below spark.sql.autoBroadcastJoinThreshold; you can force with broadcast() hint or disable with the setting.
How does a data engineer identify skew in a Spark job?
Open the Spark UI, go to the Stages tab, click the slow stage, look at the Tasks table sorted by duration. Skew shows as one or two tasks taking 5x to 10x the median time with corresponding 5x to 10x shuffle read or write. The Summary Metrics at the top of the stage page show min, 25th, 50th, 75th, max; a max that is 10x the median is the smoking gun.
What is the salt-and-rebalance technique for skewed joins?
Append a mod-N suffix to the hot key on both sides (typically N=8 to 32). The small side has its rows replicated N times (one per suffix value); the large side has the suffix appended to existing rows. Joining on the salted key spreads the formerly-hot key across N partitions instead of one. Aggregate by salted key, then unsalt by stripping the suffix and re-aggregating. Trade-off: N-fold replication of the small side and an extra aggregation step versus the original one-task-doing-all-work bottleneck.
What is Adaptive Query Execution (AQE) in Spark?
AQE in Spark 3.0+ adjusts the query plan at runtime based on actual statistics from completed stages. Three main optimizations: skew-join detection (splits a skewed partition into multiple smaller ones automatically), broadcast-threshold adjustment (uses runtime data sizes to decide broadcast vs sort-merge), and partition coalescing (combines small post-shuffle partitions). Enable with spark.sql.adaptive.enabled=true (default in 3.2+). Override when manually-optimized plans are needed.
When should a data engineer cache versus persist versus checkpoint a DataFrame?
cache() stores in memory (MEMORY_AND_DISK fallback). persist() lets you specify the storage level explicitly. checkpoint() writes to durable storage and truncates the lineage; required when the lineage is so long that recovery on executor failure would be more expensive than the checkpoint. Cache when reusing a DataFrame multiple times in the same job. Checkpoint when you have gone through 10+ wide transformations and want to break the lineage chain.
What is the difference between repartition and coalesce in Spark?
repartition(N) shuffles to N partitions; expensive but lets you increase or decrease and gives even distribution. coalesce(N) merges existing partitions without shuffle; cheap, but only decreases and can produce uneven distribution. Use repartition when going from few-and-skewed to many-and-balanced (typically before a write or a join). Use coalesce when combining small partitions before writing fewer output files.
How does a data engineer read a Spark UI screenshot showing stage skew?
Look at the Summary Metrics row. Max duration 10x the median equals task skew. Max shuffle read 10x the median equals data skew. Max spill memory greater than 0 equals memory pressure. The Tasks table sorted by duration descending shows which partition is the culprit. If one task is at 8x median time and 10x shuffle read, the join key has a hot value; salt and rebalance.
What is the difference between DataFrame and Dataset in PySpark?
PySpark has only DataFrames; Datasets are Scala/Java-only because Python's type system cannot express the strongly-typed Dataset API. In PySpark, the DataFrame API plus PySpark UDFs and pandas UDFs cover the workload. For type safety in interviews, mention Scala Datasets as the equivalent if the role is Scala-Spark.