Data Engineering Interview Prep

PySpark Joins: Broadcast, Anti, Multi-Column

The join syntax does not determine performance. The join strategy does. BroadcastHashJoin is O(n). SortMergeJoin is O(n log n). Spark picks the strategy based on table sizes, join type, and hints. Interviewers test whether you know the difference.

autoBroadcastJoinThreshold defaults to 10MB. Shuffle write/read is the #1 performance bottleneck in 80%+ of slow Spark jobs.

PySpark Join on Multiple Columns

# Join on multiple columns
result = orders.join(
    customers,
    on=["customer_id", "region"],
    how="inner"
)

# Alternative syntax with expressions
result = orders.join(
    customers,
    (orders.customer_id == customers.customer_id) &
    (orders.region == customers.region),
    how="inner"
)

Pass a list of column names when both DataFrames share the same column names. Use the expression syntax when column names differ between tables. The list syntax automatically deduplicates the join columns in the output; the expression syntax keeps both copies (you need to drop one manually).

PySpark Broadcast Join (Small Table Optimization)

from pyspark.sql import functions as F

# Force broadcast of the smaller table
result = events.join(
    F.broadcast(users),
    on="user_id",
    how="inner"
)

# Spark sends the entire users table to every executor
# No shuffle of the large events table
# Default threshold: 10MB (spark.sql.autoBroadcastJoinThreshold)

Broadcasting eliminates the shuffle of the large table entirely. The small table is sent to every executor and held in memory. BroadcastHashJoin is O(n) on the large side. The tradeoff: if you broadcast a table that is too large, executors OOM. Reaching for broadcast when the table is 50GB is a common interview mistake. Verify the in-memory size, not just the on-disk size, because deserialization expands compressed data.

PySpark Left Anti Join (Find Missing Rows)

# Find orders with no matching customer
orphaned_orders = orders.join(
    customers,
    on="customer_id",
    how="left_anti"
)

# Equivalent SQL
spark.sql("""
    SELECT o.*
    FROM orders o
    LEFT ANTI JOIN customers c
    ON o.customer_id = c.customer_id
""")

Left anti join returns rows from the left table that have no match in the right table. It is the PySpark equivalent of WHERE NOT EXISTS or WHERE customer_id NOT IN (subquery). More efficient than a left join followed by a filter for IS NULL because Spark can short-circuit the comparison.

PySpark Left Semi Join (Exists Check)

# Find customers who have at least one order
active_customers = customers.join(
    orders,
    on="customer_id",
    how="left_semi"
)

# Returns only columns from the left table
# Does NOT duplicate rows if multiple orders exist

Left semi join returns rows from the left table where a match exists in the right table, but only returns left-table columns. It does not multiply rows. Use it as a performant WHERE EXISTS. In production, this is common for filtering dimension tables against fact tables before a heavier join.

Handle Duplicate Columns After PySpark Join

# Problem: both tables have 'name' column
# After join, you get ambiguous 'name' references

# Solution 1: Rename before joining
customers_renamed = customers.withColumnRenamed("name", "customer_name")
result = orders.join(customers_renamed, on="customer_id")

# Solution 2: Use aliases and select specific columns
result = orders.alias("o").join(
    customers.alias("c"),
    F.col("o.customer_id") == F.col("c.customer_id")
).select("o.*", F.col("c.name").alias("customer_name"))

# Solution 3: Drop after join
result = orders.join(customers, on="customer_id").drop(customers.name)

Duplicate column names after joins cause ambiguous reference errors downstream. The cleanest approach is to rename before the join. If you use the list syntax for the join condition (on=['customer_id']), the join key column is automatically deduplicated. Interviewers watch for whether you handle this proactively or discover it at runtime.

PySpark Cross Join (Cartesian Product)

# Every combination of date and product
date_product_grid = dates.crossJoin(products)

# Useful for generating scaffolds, then left-joining actuals
# WARNING: row count = left_rows * right_rows
# 1000 dates * 5000 products = 5 million rows

Cross joins produce the Cartesian product. Spark requires explicit .crossJoin() or setting spark.sql.crossJoin.enabled=true to prevent accidental Cartesian products from malformed join conditions. Use sparingly. In production, cross joins appear when building date/dimension scaffolds for gap-filling queries.

How Spark Chooses a Join Strategy

Spark evaluates table statistics, join type (inner, left, anti), and any broadcast hints you provide. AQE can override the initial choice at runtime if the actual data sizes differ from catalog estimates. Data skew triggers AQE skew join optimization when a partition exceeds 5x the median size AND 256MB.

BroadcastHashJoin

When: One side fits in executor memory. Default threshold: 10MB (spark.sql.autoBroadcastJoinThreshold).

Cost: O(n) scan of the large side. No shuffle. Fastest join strategy by a wide margin.

SortMergeJoin

When: Both sides are large. Default strategy for equi-joins when neither side is small enough to broadcast.

Cost: O(n log n). Both sides shuffled into spark.sql.shuffle.partitions (default 200) and sorted by join key.

ShuffleHashJoin

When: One side is medium-sized. Avoids the sort step of SortMergeJoin.

Cost: Both sides shuffled. The smaller side builds a hash table per partition. Faster than SortMergeJoin when the hash table fits in memory.

BroadcastNestedLoopJoin

When: Non-equi joins (theta joins) with a small table. Fallback for conditions Spark cannot hash or sort-merge.

Cost: Slowest. Compares every row pair. Avoid on large tables.

Common PySpark Join Mistakes in Interviews

Reaching for broadcast when the table is 50GB. Verify in-memory size, not just file size on disk.

Using a left join + WHERE IS NULL instead of left_anti. The anti join lets Spark short-circuit without materializing null columns.

Ignoring AQE skew join optimization. If one partition holds 10x+ the median partition size, AQE splits it automatically in Spark 3.2+.

Not aliasing DataFrames before joining, then hitting ambiguous column errors downstream in a 15-step pipeline.

Forgetting that spark.sql.shuffle.partitions defaults to 200. After a join on a 500GB table, 200 partitions means 2.5GB per partition, likely causing executor OOM.

PySpark Joins FAQ

How do I join PySpark DataFrames on multiple columns?+
Pass a list of column names: df1.join(df2, on=['col_a', 'col_b'], how='inner'). This works when both DataFrames share the same column names. For different names, use the expression syntax with & to combine conditions.
What is a left anti join in PySpark?+
A left anti join returns rows from the left DataFrame that have no matching row in the right DataFrame. It is the PySpark equivalent of SQL's WHERE NOT EXISTS. Use how='left_anti' in the join call.
How do I avoid duplicate columns after a PySpark join?+
Use the list syntax for the join condition (on=['key']) which auto-deduplicates the key column. For other shared columns, rename before the join with .withColumnRenamed() or use aliases and explicit .select().
When should I use a broadcast join in PySpark?+
When one side of the join fits in executor memory. The default autoBroadcastJoinThreshold is 10MB. Broadcasting eliminates the shuffle of the large table. Force it with F.broadcast(small_df). If the small table exceeds executor memory after deserialization, executors will OOM.
How does AQE affect join strategy selection?+
Adaptive Query Execution (Spark 3.0+) re-optimizes joins at runtime. If one side of a SortMergeJoin turns out to be smaller than the broadcast threshold after filtering, AQE switches to a BroadcastHashJoin. It also handles skew joins: when a partition exceeds 5x the median size AND 256MB, AQE splits it automatically.

Practice Spark Join Questions

DataDriven runs PySpark and Scala Spark code in your browser. Write joins against real datasets, see the execution plan, and run mock interviews where an AI interviewer probes your join strategy choices.

Start Practicing