PySpark Reference

PySpark isin(): Performance Limits and the Semi Join Alternative

isin() with more than 10,000 values degrades to a BroadcastNestedLoopJoin. For small lists it is fast and simple. For large reference sets, switch to a semi join. Negation with ~ silently drops NULL rows.

Basic isin() Filter

from pyspark.sql import functions as F

# isin() is the PySpark equivalent of SQL IN.
# Interviewers expect you to know the performance boundary.
target_depts = ["Engineering", "Product", "Design"]
result = df.filter(F.col("department").isin(target_depts))

# Under the hood, Spark inlines these values into the plan.
# For short lists (< 1,000 values), this is fast and simple.

isin() checks if a column value matches any item in a provided list. For small lists, Spark inlines the values directly into the execution plan. This is efficient for filtering on known categories, statuses, or codes.

Negation with ~ (NOT IN) and the NULL Trap

# Exclude specific statuses
result = df.filter(~F.col("status").isin(["cancelled", "refunded"]))

# WARNING: if status contains NULL values, ~isin() returns NULL
# for those rows (not True). They get filtered OUT.
# This is the same three-valued logic trap as SQL NOT IN.

# Safe version: handle NULLs explicitly
result = df.filter(
    ~F.col("status").isin(["cancelled", "refunded"]) |
    F.col("status").isNull()
)

The ~ operator negates isin(). But NULLs break the logic: ~NULL evaluates to NULL, not True. Rows with NULL in the column silently disappear from the result. A strong answer mentions this trap and handles NULLs explicitly.

Performance Limit: isin() with Large Lists

# isin() with more than 10,000 values degrades performance.
# Spark rewrites it as a BroadcastNestedLoopJoin internally,
# which is O(n*m) instead of O(n).

# Bad: 50,000 collected IDs in an isin() call
valid_ids = ref_df.select("id").rdd.flatMap(lambda x: x).collect()
result = df.filter(F.col("id").isin(valid_ids))  # slow + driver memory

# Good: use a left semi join instead. Stays distributed.
result = df.join(ref_df.select("id"), on="id", how="left_semi")

# The semi join uses a hash table on the smaller side.
# BroadcastHashJoin is O(n). BroadcastNestedLoopJoin is O(n*m).

isin() with more than 10,000 values degrades to a BroadcastNestedLoopJoin internally, which is O(n*m). It also requires collecting the list to the driver, consuming driver memory. A left semi join stays distributed and uses a hash lookup at O(n). Interviewers test whether you know this boundary and can switch approaches.

Combining isin() with Other Conditions

# Chain with & (AND) and | (OR). Parentheses are required
# because Python operator precedence puts & above |.
result = df.filter(
    F.col("department").isin(["Engineering", "Product"]) &
    (F.col("salary") > 150000) &
    ~F.col("status").isin(["terminated"])
)

# For readability on complex filters, build conditions separately
is_target_dept = F.col("department").isin(["Engineering", "Product"])
is_senior = F.col("salary") > 150000
is_active = ~F.col("status").isin(["terminated"])

result = df.filter(is_target_dept & is_senior & is_active)

Always wrap individual conditions in parentheses when using & or |. Python operator precedence can produce unexpected results without them. For complex filters, assign conditions to variables for readability.

PySpark isin() FAQ

When does isin() become too slow in PySpark?+
isin() with more than 10,000 values degrades to a BroadcastNestedLoopJoin, which is O(n*m). For large reference sets, switch to a left semi join. The semi join hashes the smaller side and probes with O(1) lookups per row.
How do I negate isin() safely with NULLs?+
Use ~F.col("column").isin([...]) | F.col("column").isNull(). Without the explicit NULL check, rows with NULL values are silently excluded because ~NULL evaluates to NULL, not True.
What is the difference between isin() and a semi join?+
isin() inlines a list of literal values into the query plan. A semi join compares against another DataFrame. For fewer than 1,000 values, isin() is simpler and equally fast. For larger reference sets, semi joins are more efficient because they stay distributed and use hash-based lookups.

Practice PySpark isin and Filtering Before Your Interview

DataDriven has PySpark challenges that test isin, semi joins, anti joins, and null handling patterns against real datasets.

Start Practicing