PySpark isin(): Performance and NOT IN (2026)

isin() with more than 10,000 values degrades to a BroadcastNestedLoopJoin. For small lists it is fast and simple. For large reference sets, switch to a semi join. Negation with ~ silently drops NULL rows.

Basic isin() Filter

from pyspark.sql import functions as F

# isin() is the PySpark equivalent of SQL IN.
# Interviewers expect you to know the performance boundary.
target_depts = ["Engineering", "Product", "Design"]
result = df.filter(F.col("department").isin(target_depts))

# Under the hood, Spark inlines these values into the plan.
# For short lists (< 1,000 values), this is fast and simple.

isin() checks if a column value matches any item in a provided list. For small lists, Spark inlines the values directly into the execution plan. This is efficient for filtering on known categories, statuses, or codes.

Negation with ~ (NOT IN) and the NULL Trap

# Exclude specific statuses
result = df.filter(~F.col("status").isin(["cancelled", "refunded"]))

# WARNING: if status contains NULL values, ~isin() returns NULL
# for those rows (not True). They get filtered OUT.
# This is the same three-valued logic trap as SQL NOT IN.

# Safe version: handle NULLs explicitly
result = df.filter(
    ~F.col("status").isin(["cancelled", "refunded"]) |
    F.col("status").isNull()
)

The ~ operator negates isin(). But NULLs break the logic: ~NULL evaluates to NULL, not True. Rows with NULL in the column silently disappear from the result. A strong answer mentions this trap and handles NULLs explicitly.

Performance Limit: isin() with Large Lists

# isin() with more than 10,000 values degrades performance.
# Spark rewrites it as a BroadcastNestedLoopJoin internally,
# which is O(n*m) instead of O(n).

# Bad: 50,000 collected IDs in an isin() call
valid_ids = ref_df.select("id").rdd.flatMap(lambda x: x).collect()
result = df.filter(F.col("id").isin(valid_ids))  # slow + driver memory

# Good: use a left semi join instead. Stays distributed.
result = df.join(ref_df.select("id"), on="id", how="left_semi")

# The semi join uses a hash table on the smaller side.
# BroadcastHashJoin is O(n). BroadcastNestedLoopJoin is O(n*m).

isin() with more than 10,000 values degrades to a BroadcastNestedLoopJoin internally, which is O(n*m). It also requires collecting the list to the driver, consuming driver memory. A left semi join stays distributed and uses a hash lookup at O(n). Interviewers test whether you know this boundary and can switch approaches.

Combining isin() with Other Conditions

# Chain with & (AND) and | (OR). Parentheses are required
# because Python operator precedence puts & above |.
result = df.filter(
    F.col("department").isin(["Engineering", "Product"]) &
    (F.col("salary") > 150000) &
    ~F.col("status").isin(["terminated"])
)

# For readability on complex filters, build conditions separately
is_target_dept = F.col("department").isin(["Engineering", "Product"])
is_senior = F.col("salary") > 150000
is_active = ~F.col("status").isin(["terminated"])

result = df.filter(is_target_dept & is_senior & is_active)

Always wrap individual conditions in parentheses when using & or |. Python operator precedence can produce unexpected results without them. For complex filters, assign conditions to variables for readability.

Prepare for the interview

01 / Open invite

02min.

Know PySpark isin the way the interviewer who asks it knows it.

a PySpark isin query, the same shape a screen would give you.

The diff against expected. Where ties broke. What you missed.

sandbox

1def sessionize(events):

2 sessions = []

3 for e in events:

4 if gap_minutes(e) > 30:

Execute your solution0.4s avg.

BlockInterview question

Solve a PySpark isin problem

PySpark isin() FAQ

When does isin() become too slow in PySpark?+

isin() with more than 10,000 values degrades to a BroadcastNestedLoopJoin, which is O(n*m). For large reference sets, switch to a left semi join. The semi join hashes the smaller side and probes with O(1) lookups per row.

How do I negate isin() safely with NULLs?+

Use ~F.col("column").isin([...]) | F.col("column").isNull(). Without the explicit NULL check, rows with NULL values are silently excluded because ~NULL evaluates to NULL, not True.

What is the difference between isin() and a semi join?+

isin() inlines a list of literal values into the query plan. A semi join compares against another DataFrame. For fewer than 1,000 values, isin() is simpler and equally fast. For larger reference sets, semi joins are more efficient because they stay distributed and use hash-based lookups.

02 / Why practice

Practice PySpark isin and Filtering Before Your Interview

01
Active recall beats re-reading by 50%
Cognitive-science meta-reviews (Dunlosky et al., 2013) rank practice testing as a top-tier study technique, while re-reading and highlighting rank near the bottom
02
76% of hiring managers reject on the coding task, not the resume
From HackerRank's 2024 Developer Skills Report. Candidates who look strong on paper still fail the live screen if they haven't done timed, executable practice
03
Five problem shapes cover 80% of data engineer loops
Parsing and reshaping, sessionization, dedup with tie-breaks, streaming aggregation, top-N-per-group. Writing them by hand turns the unfamiliar into pattern recognition

Open the problems

Related PySpark Filter Guides

PySpark Joins Guide→

Semi joins as the alternative to large isin() lists

Spark SQL Anti Join→

LEFT ANTI JOIN as the safe alternative to NOT IN

PySpark Interview Questions→

The questions data engineers face in real interviews