PySpark Coding Practice: What Gets Tested by Level
Joins make up ~22% of PySpark interview questions. Window functions are another 18%. The rest splits across groupBy, optimization, dedup, and null handling. Here is what interviewers test at each seniority level from L3 to L7.
PySpark Interview Topic Distribution
Joins (all types)
22%
Window functions
18%
GroupBy and aggregations
15%
Data skew and optimization
14%
Deduplication
10%
Execution plans and Spark UI
8%
Null handling
5%
UDFs and complex types
4%
File formats and partitioning
4%
Junior (L3/L4)
Can you write correct transformations?
DataFrame filters, selects, and column expressions~35% of junior PySpark screens
Basic joins (inner, left) with correct key handling~25%
groupBy with sum, count, avg aggregations~20%
Null handling: fillna, coalesce, isNull vs isNotNull~10%
Type casting and column renaming~10%
Example Problem
Given an orders DataFrame, calculate the total revenue per customer for the last 90 days. Exclude cancelled orders.
Common Mistake
Forgetting that a left join can introduce NULLs in the right table columns. Filtering after the join instead of before, which inflates shuffle volume.
Mid-Level (L4/L5)
Can you handle real data problems?
Window functions: row_number, rank, lag, running totals~30% of mid-level screens
Deduplication: dropDuplicates vs window dedup~20%
Multi-table joins with 3+ DataFrames~15%
Pivot tables and complex aggregations~15%
UDFs: when to use them, when to avoid them~10%
Date/time manipulation and time zone handling~10%
Example Problem
For each product category, find the top 3 customers by spending in the last quarter. Include their rank and percentage of category total.
Common Mistake
Using rank() instead of row_number() and getting duplicate ranks. Reaching for a UDF when a built-in function exists (Spark 3.5 ships with 1,500+ built-in functions).
Senior (L5/L6)
Can you diagnose and fix performance problems?
Join optimization: broadcast vs SortMerge, skew handling~25% of senior screens
Execution plan reading: explain() output interpretation~15%
Caching strategy: when to persist, when to checkpoint~10%
Dynamic partition pruning and predicate pushdown~10%
Example Problem
A nightly job joining 800M rows with a 2M-row lookup is stuck. One task reads 15.8GB while 199 others finished in 22 seconds. Diagnose the root cause and write the fix.
Common Mistake
Reaching for broadcast when the table is 50GB (autoBroadcastJoinThreshold defaults to 10MB). Not recognizing that shuffle write/read is the #1 bottleneck in 80%+ of slow Spark jobs.
Staff (L7+)
Can you design the system, not just write the query?
Pipeline architecture: partition strategy, file sizingTested in system design rounds
Incremental processing: watermarks, merge patternsTested in system design rounds
Cost modeling: executor sizing, dynamic allocation tradeoffsTested in system design rounds
Spark internals: Catalyst optimizer, Tungsten memory modelTested in deep-dive rounds
Example Problem
Design a pipeline that processes 2TB of clickstream data daily. The downstream team needs sub-minute freshness for dashboards but also runs weekly ML training jobs on the same data.
Common Mistake
Optimizing the query without questioning the partition layout. Proposing streaming without addressing exactly-once semantics or late data handling.
PySpark Coding Practice FAQ
What PySpark topics are most tested in data engineering interviews?+
Joins account for roughly 22% of PySpark interview questions, followed by window functions at 18% and groupBy/aggregations at 15%. Senior roles shift heavily toward optimization: data skew, shuffle analysis, and execution plan interpretation make up another 22% combined.
How many PySpark problems should I practice before an interview?+
For junior roles, 15 to 20 problems covering joins, groupBy, and filters gives reasonable coverage. For senior roles, add 10 to 15 optimization and debugging problems. The goal is not volume. It is recognizing patterns: when you see a skewed join, you should reach for salting without thinking.
Is PySpark or Scala Spark more common in interviews?+
PySpark accounts for roughly 70% of Spark API usage in production. Most interviews default to PySpark unless the role is on an infrastructure team that maintains Spark libraries in Scala. Practice in the language your target company uses.
What separates a passing PySpark answer from a strong one?+
A passing answer produces correct output. A strong answer also mentions shuffle cost, explains why the chosen approach scales, and identifies edge cases (NULLs, skew, late-arriving data). Interviewers test whether you have built and maintained a real pipeline, not just written a correct query.