Spark SQL Reference

Spark SQL Anti Join: LEFT ANTI JOIN Syntax and Performance

LEFT ANTI JOIN finds unmatched rows and short-circuits on first match found. Faster than LEFT JOIN + IS NULL. Safer than NOT IN when NULLs exist. Here is the syntax for both Spark SQL and PySpark, with the performance tradeoffs interviewers test.

LEFT ANTI JOIN Syntax in Spark SQL

-- Find orders with no matching customer record.
-- LEFT ANTI JOIN short-circuits on first match found:
-- once Spark confirms a match exists, it skips that row
-- and moves on. No need to scan further.

SELECT o.*
FROM orders o
LEFT ANTI JOIN customers c
ON o.customer_id = c.customer_id

-- Returns: all orders where customer_id has no match in customers
-- Only columns from the left table (orders) appear in the result

LEFT ANTI JOIN returns rows from the left table where the join condition finds no match in the right table. It short-circuits on first match found: Spark does not need to scan further once it confirms a match exists for a given row. This makes it faster than LEFT JOIN + IS NULL for the same logical operation.

LEFT ANTI JOIN in PySpark DataFrame API

# PySpark uses how="left_anti" as the join type
orphaned_orders = orders.join(
    customers,
    on="customer_id",
    how="left_anti"
)

# The result contains ONLY columns from orders.
# No columns from customers appear because no match exists.
# This is a key difference from left_semi, which also
# returns only left columns but keeps MATCHED rows.

In PySpark, pass how="left_anti" to the join method. The result contains only columns from the left DataFrame. Interviewers sometimes ask the difference between left_anti and left_semi: anti keeps unmatched rows, semi keeps matched rows. Both return only left-side columns.

Anti Join vs LEFT JOIN + IS NULL

-- Approach 1: LEFT ANTI JOIN (preferred in Spark)
SELECT o.*
FROM orders o
LEFT ANTI JOIN customers c
ON o.customer_id = c.customer_id

-- Approach 2: LEFT JOIN + IS NULL (traditional SQL)
SELECT o.*
FROM orders o
LEFT JOIN customers c
ON o.customer_id = c.customer_id
WHERE c.customer_id IS NULL

-- Same result. Different performance.
-- LEFT ANTI JOIN short-circuits: stops scanning once a match is found.
-- LEFT JOIN must complete the full join, carry all columns,
-- then filter. More shuffle data, more memory.

Both produce the same result. The tradeoff is performance. LEFT ANTI JOIN short-circuits on first match found, so Spark can skip rows early. LEFT JOIN + IS NULL must complete the full outer join, carry all right-side columns through the shuffle, and then filter. On a 500M-row left table joined against a 10M-row right table, the anti join moves significantly less data.

Anti Join vs NOT IN (NULL Trap)

-- NOT IN is dangerous when NULLs exist
SELECT *
FROM orders
WHERE customer_id NOT IN (SELECT customer_id FROM customers)
-- If customers.customer_id contains ANY NULL value,
-- this returns ZERO rows. SQL three-valued logic.

-- LEFT ANTI JOIN handles NULLs correctly
SELECT o.*
FROM orders o
LEFT ANTI JOIN customers c
ON o.customer_id = c.customer_id
-- Returns the correct result regardless of NULLs

NOT IN has a well-known NULL trap: if the subquery contains even one NULL value, the entire NOT IN evaluates to UNKNOWN for every row and returns nothing. LEFT ANTI JOIN handles NULLs correctly. Mentioning this in an interview signals production experience with data quality issues.

Anti Join on Multiple Columns

-- Find date/product combinations with no sales
SELECT d.date, p.product_id
FROM dates d
CROSS JOIN products p
LEFT ANTI JOIN sales s
ON d.date = s.sale_date AND p.product_id = s.product_id

-- PySpark equivalent
grid = dates.crossJoin(products)
gaps = grid.join(
    sales,
    on=["date", "product_id"],
    how="left_anti"
)

Anti joins work with composite keys. This pattern finds gaps in coverage: missing date/product combinations, missing region/metric pairs, or any other hole in a cross-product. In production, this is how you build data quality checks for completeness.

Spark SQL Anti Join FAQ

What does LEFT ANTI JOIN do in Spark SQL?+
It returns rows from the left table that have no matching row in the right table. It is equivalent to NOT EXISTS in standard SQL. It short-circuits on first match found, so Spark stops scanning the right side as soon as it confirms a match exists for a given left row.
Is LEFT ANTI JOIN the same as EXCEPT in Spark?+
No. EXCEPT compares all columns between two DataFrames and deduplicates the result. LEFT ANTI JOIN compares only the join key columns, preserves all left-side columns, and does not deduplicate. Use EXCEPT when you want set subtraction on identical schemas. Use LEFT ANTI JOIN when you want unmatched rows from a specific table.
Does Spark support RIGHT ANTI JOIN?+
No. Spark only supports LEFT ANTI JOIN. To get a right anti join effect, swap the table order: put the table you want results from on the left side of the join.
Why is LEFT ANTI JOIN faster than LEFT JOIN + IS NULL?+
LEFT ANTI JOIN short-circuits on first match found. It does not need to carry right-side columns through the shuffle or complete the full join. LEFT JOIN + IS NULL must join all matching rows, carry all columns, and then filter. The shuffle volume difference grows with the size of the right table.

Practice Spark SQL Anti Joins Before Your Interview

DataDriven has Spark SQL challenges where you write anti joins, semi joins, and other join patterns against real datasets.

Start Practicing