PySpark isin(): Performance Limits and the Semi Join Alternative
isin() with more than 10,000 values degrades to a BroadcastNestedLoopJoin. For small lists it is fast and simple. For large reference sets, switch to a semi join. Negation with ~ silently drops NULL rows.
Basic isin() Filter
from pyspark.sql import functions as F
# isin() is the PySpark equivalent of SQL IN.
# Interviewers expect you to know the performance boundary.
target_depts = ["Engineering", "Product", "Design"]
result = df.filter(F.col("department").isin(target_depts))
# Under the hood, Spark inlines these values into the plan.
# For short lists (< 1,000 values), this is fast and simple.isin() checks if a column value matches any item in a provided list. For small lists, Spark inlines the values directly into the execution plan. This is efficient for filtering on known categories, statuses, or codes.
Negation with ~ (NOT IN) and the NULL Trap
# Exclude specific statuses
result = df.filter(~F.col("status").isin(["cancelled", "refunded"]))
# WARNING: if status contains NULL values, ~isin() returns NULL
# for those rows (not True). They get filtered OUT.
# This is the same three-valued logic trap as SQL NOT IN.
# Safe version: handle NULLs explicitly
result = df.filter(
~F.col("status").isin(["cancelled", "refunded"]) |
F.col("status").isNull()
)The ~ operator negates isin(). But NULLs break the logic: ~NULL evaluates to NULL, not True. Rows with NULL in the column silently disappear from the result. A strong answer mentions this trap and handles NULLs explicitly.
Performance Limit: isin() with Large Lists
# isin() with more than 10,000 values degrades performance.
# Spark rewrites it as a BroadcastNestedLoopJoin internally,
# which is O(n*m) instead of O(n).
# Bad: 50,000 collected IDs in an isin() call
valid_ids = ref_df.select("id").rdd.flatMap(lambda x: x).collect()
result = df.filter(F.col("id").isin(valid_ids)) # slow + driver memory
# Good: use a left semi join instead. Stays distributed.
result = df.join(ref_df.select("id"), on="id", how="left_semi")
# The semi join uses a hash table on the smaller side.
# BroadcastHashJoin is O(n). BroadcastNestedLoopJoin is O(n*m).isin() with more than 10,000 values degrades to a BroadcastNestedLoopJoin internally, which is O(n*m). It also requires collecting the list to the driver, consuming driver memory. A left semi join stays distributed and uses a hash lookup at O(n). Interviewers test whether you know this boundary and can switch approaches.
Combining isin() with Other Conditions
# Chain with & (AND) and | (OR). Parentheses are required
# because Python operator precedence puts & above |.
result = df.filter(
F.col("department").isin(["Engineering", "Product"]) &
(F.col("salary") > 150000) &
~F.col("status").isin(["terminated"])
)
# For readability on complex filters, build conditions separately
is_target_dept = F.col("department").isin(["Engineering", "Product"])
is_senior = F.col("salary") > 150000
is_active = ~F.col("status").isin(["terminated"])
result = df.filter(is_target_dept & is_senior & is_active)Always wrap individual conditions in parentheses when using & or |. Python operator precedence can produce unexpected results without them. For complex filters, assign conditions to variables for readability.
PySpark isin() FAQ
When does isin() become too slow in PySpark?+
How do I negate isin() safely with NULLs?+
What is the difference between isin() and a semi join?+
Practice PySpark isin and Filtering Before Your Interview
DataDriven has PySpark challenges that test isin, semi joins, anti joins, and null handling patterns against real datasets.
Start Practicing