PySpark Interview Questions for Data Engineers (2026)
PySpark is the most common way data engineers interact with Spark. Interviewers test DataFrame fluency, join strategies, shuffle awareness, and the ability to diagnose slow jobs from the Spark UI.
Written by engineers who have conducted hundreds of data engineering interviews at companies running PySpark in production.
TL;DR: PySpark Interview Questions
PySpark interview questions for data engineers in 2026 concentrate on DataFrame API operations, joins (broadcast vs sort-merge), window functions, UDF performance, partitioning and shuffles, and data skew handling.
For most coding interviews you write a window-function query or aggregation, then a debugging scenario where you diagnose a slow stage from the Spark UI. Senior roles add system design (a streaming pipeline, a Delta Lake table layout, a Pandas UDF vs Arrow tradeoff). Avoid Python UDFs when built-in functions exist; broadcast small dimensions; cache only when reused.
PySpark Best Practices: Decision Matrix
Interviewers love asking which approach to take. These are the answers that signal production experience.
| Question | Right answer | Why |
|---|---|---|
| UDF or built-in? | Built-in (regexp_extract, when, coalesce) | Catalyst can optimize built-ins; UDFs are black boxes with serialization overhead |
| Python UDF or Pandas UDF? | Pandas UDF | Vectorized over Arrow batches: 10-100x faster |
| repartition or coalesce? | repartition for new partition count, coalesce only when reducing | coalesce avoids shuffle but causes skew if reducing too aggressively |
| Cache or not? | Cache only if read 2+ times | Caching once-used DataFrames wastes executor memory |
| select() or withColumn()? | select() for many columns; withColumn() for one or two | Chained withColumn() creates a new projection each time |
| .show() or .write()? | .write() for production; .show() only for debugging | .show() forces a tiny action that may run a partial DAG inefficiently |
What PySpark Interviewers Expect
PySpark questions separate candidates who have written real pipelines from those who only completed a tutorial. Interviewers do not care whether you memorize every function signature. They care whether you understand how data moves through a distributed system and whether you can make it move efficiently.
Junior candidates should be able to write basic DataFrame transformations: filters, joins, group-by aggregations, and window functions. You should explain the difference between transformations and actions.
Mid-level candidates need to discuss partitioning strategies, broadcast joins, and why Python UDFs are slow. You should be able to read a Spark plan and identify shuffle boundaries.
Senior candidates must diagnose performance problems: data skew, small file problems, memory pressure, and speculative execution. You should articulate tradeoffs between different join strategies and explain when AQE helps and when it does not.
PySpark Core Concepts Interviewers Test
DataFrame API vs RDD
The DataFrame API is the primary interface for PySpark in production. RDDs still matter for understanding the execution model, but interviewers want to see you default to DataFrames. Know when RDDs are appropriate: custom partitioners, unstructured data, or low-level control.
Transformations vs Actions
Transformations are lazy. Actions trigger execution. Interviewers will ask you to trace a chain of operations and identify where computation actually happens. Common trap: calling .count() inside a loop forces a full DAG evaluation each time.
Partitioning Strategy
repartition() triggers a full shuffle. coalesce() reduces partitions without a shuffle. Hash partitioning distributes data by key. Range partitioning is useful for sorted output. Interviewers test whether you understand the cost of each approach.
Broadcast Joins
When one side of a join fits in memory, broadcasting it to every executor avoids a shuffle. The threshold is controlled by spark.sql.autoBroadcastJoinThreshold (default 10MB). Interviewers ask what happens when you broadcast a table that is too large: OOM errors on executors.
UDFs and Performance
Python UDFs serialize data between JVM and Python, creating massive overhead. Pandas UDFs (vectorized) are 10x to 100x faster because they operate on Arrow batches. Interviewers want to hear you avoid UDFs entirely when a built-in function exists.
Data Skew Handling
Skew causes one partition to take 100x longer than others. Salting keys, using broadcast joins on the skewed side, or repartitioning before the join are standard fixes. Interviewers test whether you can diagnose skew from Spark UI symptoms.
PySpark Interview Questions with Guidance
Explain the difference between DataFrame.select() and DataFrame.withColumn(). When would you use each?
select() projects specific columns and can rename or transform them. withColumn() adds or replaces a single column while keeping all others. Use select() when you want to control exactly which columns appear in the output. Use withColumn() for adding derived columns. A strong answer notes that chaining many withColumn() calls is inefficient because each creates a new projection; select() with multiple expressions is better.
A PySpark job takes 4 hours. The Spark UI shows one task in the final stage taking 3.5 hours while all others finish in 2 minutes. What is happening and how do you fix it?
This is data skew. One partition holds disproportionately more data than others. Fixes include salting the join key (appending a random suffix, joining on the salted key, then aggregating), broadcasting the smaller table, or using AQE skew join optimization (spark.sql.adaptive.skewJoin.enabled). A strong answer mentions checking the partition sizes in the Spark UI to confirm the diagnosis before applying a fix.
What is the difference between repartition(n) and coalesce(n)? When would coalesce cause problems?
repartition(n) performs a full shuffle to create exactly n partitions with roughly equal data. coalesce(n) merges existing partitions without a shuffle, so it can only reduce partition count. coalesce causes problems when reducing from many partitions to very few: some executors end up with much more data than others, creating skew. It also cannot increase partition count.
You have a Python UDF that runs a regex on every row. The job is slow. Walk through your optimization approach.
First, check if a built-in function can replace the UDF (regexp_extract, regexp_replace). If not, convert to a Pandas UDF (vectorized) to process Arrow batches instead of row-by-row serialization. If the regex is complex, consider precompiling the pattern outside the UDF. A strong answer quantifies the cost: Python UDFs force serialization between JVM and Python for every row, which can be 10x to 100x slower than native Spark functions.
Explain how PySpark window functions work. Write a query that ranks employees by salary within each department.
Window functions operate on a partition of rows defined by a WindowSpec. You define partitionBy (grouping), orderBy (sorting), and optionally rowsBetween or rangeBetween. The computation runs without collapsing rows. A strong answer includes: from pyspark.sql.window import Window; w = Window.partitionBy('dept').orderBy(F.desc('salary')); df.withColumn('rank', F.rank().over(w)).
What happens when you cache a DataFrame in PySpark? Where is it stored and when should you avoid caching?
cache() marks the DataFrame for storage in executor memory (MEMORY_AND_DISK by default). The data is materialized on the first action and reused for subsequent actions. Avoid caching when the DataFrame is used only once, when it is very large relative to available memory, or when the upstream computation is cheap. A strong answer mentions unpersist() to free memory explicitly and notes that caching serialized data (MEMORY_ONLY_SER) reduces memory footprint at the cost of CPU.
How do you handle late-arriving data in a PySpark structured streaming job?
Use watermarks. withWatermark('event_time', '10 minutes') tells Spark to drop data older than 10 minutes past the latest event time seen. This allows stateful aggregations to discard stale state. A strong answer discusses the tradeoff: a shorter watermark frees memory faster but drops more late data; a longer watermark retains more state but uses more memory. Mention that output mode (append vs update vs complete) affects when results are emitted.
You need to join a 500GB fact table with a 50MB dimension table in PySpark. Describe your approach.
Broadcast the 50MB dimension table. Use F.broadcast(dim_df) or rely on autoBroadcastJoinThreshold. This sends the small table to every executor and avoids shuffling the 500GB fact table entirely. A strong answer mentions verifying the dimension table size is accurate (not just the on-disk size, which may differ from in-memory size after deserialization), and notes that if the dimension table grows past the threshold, the job will silently switch to a sort-merge join with a full shuffle.
What is the Catalyst optimizer and how does it affect PySpark code?
Catalyst is Spark SQL's query optimizer. It converts logical plans to optimized physical plans through rule-based and cost-based optimization. It handles predicate pushdown, column pruning, join reordering, and constant folding. PySpark DataFrame operations go through Catalyst; RDD operations do not. This is a key reason DataFrames outperform RDDs for structured data. A strong answer notes that Catalyst cannot optimize Python UDFs because it treats them as black boxes.
Explain the difference between map(), mapPartitions(), and foreach() in PySpark. When would you choose each?
map() applies a function to each row individually. mapPartitions() applies a function to an entire partition iterator, which is more efficient for operations that have setup cost (database connections, model loading). foreach() is an action that applies a function for side effects without returning data. Choose mapPartitions() when you need to amortize expensive initialization across rows. Choose foreach() or foreachPartition() for writing to external systems.
Practice PySpark in Your Browser
Reading answers is not the same as writing code under pressure. DataDriven runs a Spark-compatible execution engine in your browser that transpiles PySpark DataFrame operations, executes them against real datasets, and returns results in seconds. Write joins, window functions, and aggregations the same way you would on a production cluster. Then run mock interviews where an AI interviewer asks follow-ups about your approach.
Start Practicing PySparkWorked Example: Salting a Skewed Join
When one join key dominates the data (e.g., 80% of events belong to one customer), the partition handling that key becomes a bottleneck. Salting distributes the hot key across multiple partitions.
from pyspark.sql import functions as F
SALT_BUCKETS = 10
# Add salt column to the large (fact) table
fact_salted = fact_df.withColumn(
"salt", (F.rand() * SALT_BUCKETS).cast("int")
)
# Explode the small (dimension) table across salt values
dim_exploded = dim_df.crossJoin(
spark.range(SALT_BUCKETS).withColumnRenamed("id", "salt")
)
# Join on original key + salt
result = fact_salted.join(
dim_exploded,
on=["customer_id", "salt"],
how="inner"
).drop("salt")The fact table gets a random salt from 0 to 9. The dimension table is duplicated 10 times, once per salt value. The join now distributes the hot key across 10 partitions instead of one. The cost: 10x duplication of the small table. The benefit: near linear speedup on skewed keys.
Common PySpark Interview Mistakes
Using collect() on large DataFrames, pulling millions of rows to the driver and causing OOM errors
Writing Python UDFs for operations that have built-in Spark equivalents like regexp_extract or when/otherwise
Ignoring partition count after filtering, leaving thousands of near-empty partitions that create task scheduling overhead
Caching DataFrames that are only used once, wasting executor memory
Not understanding that .show() and .count() are actions that trigger full DAG evaluation
Confusing DataFrame.groupBy().agg() with RDD.groupByKey(), which loads all values for a key into memory
PySpark Interview Questions FAQ
Is PySpark tested differently than Scala Spark in interviews?+
Do I need to memorize PySpark function signatures for interviews?+
How deep should I go on Spark internals for a PySpark interview?+
Should I practice PySpark on my laptop or on a cluster?+
What PySpark version should I study for interviews in 2026?+
What are the most common PySpark coding interview questions?+
What are PySpark best practices interviewers look for?+
What core concepts should I master before a PySpark interview?+
How are PySpark interview questions and answers structured at FAANG companies?+
Practice PySpark Interview Questions
Write real PySpark code. Understand the execution plan. Walk into your interview knowing exactly how data moves through the cluster.
Continue your prep
Data Engineer Interview Prep, explore the full guide
50+ guides covering every round, company, role, and technology in the data engineer interview loop. Grounded in 2,817 verified interview reports across 921 companies, collected from real candidates.
Interview Rounds
By Company
- Stripe Data Engineer Interview
- Airbnb Data Engineer Interview
- Uber Data Engineer Interview
- Netflix Data Engineer Interview
- Databricks Data Engineer Interview
- Snowflake Data Engineer Interview
- Lyft Data Engineer Interview
- DoorDash Data Engineer Interview
- Instacart Data Engineer Interview
- Robinhood Data Engineer Interview
- Pinterest Data Engineer Interview
- Twitter/X Data Engineer Interview
By Role
- Senior Data Engineer Interview
- Staff Data Engineer Interview
- Principal Data Engineer Interview
- Junior Data Engineer Interview
- Entry-Level Data Engineer Interview
- Analytics Engineer Interview
- ML Data Engineer Interview
- Streaming Data Engineer Interview
- GCP Data Engineer Interview
- AWS Data Engineer Interview
- Azure Data Engineer Interview