Data Engineering Interview Prep

PySpark Interview Questions for Data Engineers (2026)

PySpark is the most common way data engineers interact with Spark. Interviewers test DataFrame fluency, join strategies, shuffle awareness, and the ability to diagnose slow jobs from the Spark UI.

Written by engineers who have conducted hundreds of data engineering interviews at companies running PySpark in production.

What Interviewers Expect

PySpark questions separate candidates who have written real pipelines from those who only completed a tutorial. Interviewers do not care whether you memorize every function signature. They care whether you understand how data moves through a distributed system and whether you can make it move efficiently.

Junior candidates should be able to write basic DataFrame transformations: filters, joins, group-by aggregations, and window functions. You should explain the difference between transformations and actions.

Mid-level candidates need to discuss partitioning strategies, broadcast joins, and why Python UDFs are slow. You should be able to read a Spark plan and identify shuffle boundaries.

Senior candidates must diagnose performance problems: data skew, small file problems, memory pressure, and speculative execution. You should articulate tradeoffs between different join strategies and explain when AQE helps and when it does not.

Core Concepts Interviewers Test

DataFrame API vs RDD

The DataFrame API is the primary interface for PySpark in production. RDDs still matter for understanding the execution model, but interviewers want to see you default to DataFrames. Know when RDDs are appropriate: custom partitioners, unstructured data, or low-level control.

Transformations vs Actions

Transformations are lazy. Actions trigger execution. Interviewers will ask you to trace a chain of operations and identify where computation actually happens. Common trap: calling .count() inside a loop forces a full DAG evaluation each time.

Partitioning Strategy

repartition() triggers a full shuffle. coalesce() reduces partitions without a shuffle. Hash partitioning distributes data by key. Range partitioning is useful for sorted output. Interviewers test whether you understand the cost of each approach.

Broadcast Joins

When one side of a join fits in memory, broadcasting it to every executor avoids a shuffle. The threshold is controlled by spark.sql.autoBroadcastJoinThreshold (default 10MB). Interviewers ask what happens when you broadcast a table that is too large: OOM errors on executors.

UDFs and Performance

Python UDFs serialize data between JVM and Python, creating massive overhead. Pandas UDFs (vectorized) are 10x to 100x faster because they operate on Arrow batches. Interviewers want to hear you avoid UDFs entirely when a built-in function exists.

Data Skew Handling

Skew causes one partition to take 100x longer than others. Salting keys, using broadcast joins on the skewed side, or repartitioning before the join are standard fixes. Interviewers test whether you can diagnose skew from Spark UI symptoms.

PySpark Interview Questions with Guidance

Q1

Explain the difference between DataFrame.select() and DataFrame.withColumn(). When would you use each?

A strong answer includes:

select() projects specific columns and can rename or transform them. withColumn() adds or replaces a single column while keeping all others. Use select() when you want to control exactly which columns appear in the output. Use withColumn() for adding derived columns. A strong answer notes that chaining many withColumn() calls is inefficient because each creates a new projection; select() with multiple expressions is better.

Q2

A PySpark job takes 4 hours. The Spark UI shows one task in the final stage taking 3.5 hours while all others finish in 2 minutes. What is happening and how do you fix it?

A strong answer includes:

This is data skew. One partition holds disproportionately more data than others. Fixes include salting the join key (appending a random suffix, joining on the salted key, then aggregating), broadcasting the smaller table, or using AQE skew join optimization (spark.sql.adaptive.skewJoin.enabled). A strong answer mentions checking the partition sizes in the Spark UI to confirm the diagnosis before applying a fix.

Q3

What is the difference between repartition(n) and coalesce(n)? When would coalesce cause problems?

A strong answer includes:

repartition(n) performs a full shuffle to create exactly n partitions with roughly equal data. coalesce(n) merges existing partitions without a shuffle, so it can only reduce partition count. coalesce causes problems when reducing from many partitions to very few: some executors end up with much more data than others, creating skew. It also cannot increase partition count.

Q4

You have a Python UDF that runs a regex on every row. The job is slow. Walk through your optimization approach.

A strong answer includes:

First, check if a built-in function can replace the UDF (regexp_extract, regexp_replace). If not, convert to a Pandas UDF (vectorized) to process Arrow batches instead of row-by-row serialization. If the regex is complex, consider precompiling the pattern outside the UDF. A strong answer quantifies the cost: Python UDFs force serialization between JVM and Python for every row, which can be 10x to 100x slower than native Spark functions.

Q5

Explain how PySpark window functions work. Write a query that ranks employees by salary within each department.

A strong answer includes:

Window functions operate on a partition of rows defined by a WindowSpec. You define partitionBy (grouping), orderBy (sorting), and optionally rowsBetween or rangeBetween. The computation runs without collapsing rows. A strong answer includes: from pyspark.sql.window import Window; w = Window.partitionBy('dept').orderBy(F.desc('salary')); df.withColumn('rank', F.rank().over(w)).

Q6

What happens when you cache a DataFrame in PySpark? Where is it stored and when should you avoid caching?

A strong answer includes:

cache() marks the DataFrame for storage in executor memory (MEMORY_AND_DISK by default). The data is materialized on the first action and reused for subsequent actions. Avoid caching when the DataFrame is used only once, when it is very large relative to available memory, or when the upstream computation is cheap. A strong answer mentions unpersist() to free memory explicitly and notes that caching serialized data (MEMORY_ONLY_SER) reduces memory footprint at the cost of CPU.

Q7

How do you handle late-arriving data in a PySpark structured streaming job?

A strong answer includes:

Use watermarks. withWatermark('event_time', '10 minutes') tells Spark to drop data older than 10 minutes past the latest event time seen. This allows stateful aggregations to discard stale state. A strong answer discusses the tradeoff: a shorter watermark frees memory faster but drops more late data; a longer watermark retains more state but uses more memory. Mention that output mode (append vs update vs complete) affects when results are emitted.

Q8

You need to join a 500GB fact table with a 50MB dimension table in PySpark. Describe your approach.

A strong answer includes:

Broadcast the 50MB dimension table. Use F.broadcast(dim_df) or rely on autoBroadcastJoinThreshold. This sends the small table to every executor and avoids shuffling the 500GB fact table entirely. A strong answer mentions verifying the dimension table size is accurate (not just the on-disk size, which may differ from in-memory size after deserialization), and notes that if the dimension table grows past the threshold, the job will silently switch to a sort-merge join with a full shuffle.

Q9

What is the Catalyst optimizer and how does it affect PySpark code?

A strong answer includes:

Catalyst is Spark SQL's query optimizer. It converts logical plans to optimized physical plans through rule-based and cost-based optimization. It handles predicate pushdown, column pruning, join reordering, and constant folding. PySpark DataFrame operations go through Catalyst; RDD operations do not. This is a key reason DataFrames outperform RDDs for structured data. A strong answer notes that Catalyst cannot optimize Python UDFs because it treats them as black boxes.

Q10

Explain the difference between map(), mapPartitions(), and foreach() in PySpark. When would you choose each?

A strong answer includes:

map() applies a function to each row individually. mapPartitions() applies a function to an entire partition iterator, which is more efficient for operations that have setup cost (database connections, model loading). foreach() is an action that applies a function for side effects without returning data. Choose mapPartitions() when you need to amortize expensive initialization across rows. Choose foreach() or foreachPartition() for writing to external systems.

Worked Example: Salting a Skewed Join

When one join key dominates the data (e.g., 80% of events belong to one customer), the partition handling that key becomes a bottleneck. Salting distributes the hot key across multiple partitions.

from pyspark.sql import functions as F

SALT_BUCKETS = 10

# Add salt column to the large (fact) table
fact_salted = fact_df.withColumn(
    "salt", (F.rand() * SALT_BUCKETS).cast("int")
)

# Explode the small (dimension) table across salt values
dim_exploded = dim_df.crossJoin(
    spark.range(SALT_BUCKETS).withColumnRenamed("id", "salt")
)

# Join on original key + salt
result = fact_salted.join(
    dim_exploded,
    on=["customer_id", "salt"],
    how="inner"
).drop("salt")

The fact table gets a random salt from 0 to 9. The dimension table is duplicated 10 times, once per salt value. The join now distributes the hot key across 10 partitions instead of one. The cost: 10x duplication of the small table. The benefit: near linear speedup on skewed keys.

Common Mistakes in PySpark Interviews

Using collect() on large DataFrames, pulling millions of rows to the driver and causing OOM errors

Writing Python UDFs for operations that have built-in Spark equivalents like regexp_extract or when/otherwise

Ignoring partition count after filtering, leaving thousands of near-empty partitions that create task scheduling overhead

Caching DataFrames that are only used once, wasting executor memory

Not understanding that .show() and .count() are actions that trigger full DAG evaluation

Confusing DataFrame.groupBy().agg() with RDD.groupByKey(), which loads all values for a key into memory

PySpark Interview Questions FAQ

Is PySpark tested differently than Scala Spark in interviews?+
Yes. PySpark interviews focus on the DataFrame API, Python UDF performance tradeoffs, and integration with Python libraries. Scala Spark interviews go deeper into type safety, RDD internals, and JVM tuning. Most data engineering roles use PySpark, so prepare for Python-first questions unless the job description specifies Scala.
Do I need to memorize PySpark function signatures for interviews?+
Know the common ones by heart: F.col(), F.lit(), F.when(), F.coalesce(), F.regexp_extract(), Window.partitionBy().orderBy(). You do not need to memorize every function, but fumbling through basic column operations signals inexperience. Practice writing PySpark without autocomplete.
How deep should I go on Spark internals for a PySpark interview?+
Understand the driver/executor model, how shuffles work, what a stage boundary is, and how to read the Spark UI. You do not need to know Netty internals or scheduler implementation details. Interviewers care that you can diagnose performance problems, not that you can recite source code.
Should I practice PySpark on my laptop or on a cluster?+
Start on your laptop with local mode (spark.master = local[*]). This lets you iterate quickly. Once comfortable, practice on a cluster (Databricks Community Edition is free) to understand executor behavior, memory limits, and real shuffle costs. Interview questions are testable locally, but cluster experience helps you answer architecture questions confidently.
What PySpark version should I study for interviews in 2026?+
Focus on Spark 3.4+ features: Adaptive Query Execution (AQE), dynamic partition pruning, and Pandas UDFs with Arrow. Most companies run Spark 3.x in production. If a company uses Databricks, also study Delta Lake integration and Unity Catalog.

Practice PySpark Interview Questions

Write real PySpark code. Understand the execution plan. Walk into your interview knowing exactly how data moves through the cluster.