Spark · Interview concept

Spark Joins: Broadcast, Shuffle, and Skew

Spark joins look like SQL joins but execute as distributed shuffles. Choosing broadcast vs shuffle, handling skewed keys, and reading the physical plan are the interview-relevant skills.

Try itBroadcast join on a small dimension table

Edit the PySpark code and click Run. It transpiles to SQL and runs in your browser, expand the transpiled-SQL panel to see what it produced.

joins_demo.py
Click Run to execute. Edit the code above to experiment.

Overview

What Spark Joins is and why it matters

A Spark join reads two DataFrames, shuffles their rows so matching keys land on the same executor, and produces combined rows. The shuffle is the expensive part: it writes the data to disk, serializes it across the network, and re-partitions it by the join key.

If one side is small enough to fit in each executor's memory, Spark can broadcast it instead. Broadcast joins skip the shuffle entirely and are 10-100x faster. Knowing when broadcast applies and how to force or prevent it is the core of Spark join mastery.

Syntax

Spark Joins reference

TokenMeaning
df1.join(df2, cond, 'inner')Standard inner join on a boolean condition. 'inner' is the default.
'left' / 'right' / 'outer'Outer join variants. 'outer' is the SQL FULL OUTER JOIN.
'left_anti'Returns rows from the left side that have no match on the right. The SQL NOT EXISTS pattern.
'left_semi'Returns rows from the left side that have at least one match on the right, without right columns.
broadcast(df)Hint to the planner: broadcast this side instead of shuffling. Fails if the side exceeds spark.sql.autoBroadcastJoinThreshold.

Patterns

How Spark Joins shows up in practice

Broadcast join for small dimension tables

When one side is small (under ~10-50 MB), broadcasting skips the shuffle. Spark auto-broadcasts below the threshold, but explicit broadcast() makes intent clear and guarantees the plan.

1orders.join(broadcast(countries),
2 orders.country == countries.code,
3 "inner")

Interview notespark.sql.autoBroadcastJoinThreshold defaults to 10 MB. Interviewers will ask how Spark decides between broadcast and shuffle, the answer is this threshold plus the join hint.

Sort-merge join (the shuffle default)

When both sides are large, Spark uses sort-merge: shuffle by key, sort each partition, then merge. This is the default join strategy for big-on-big joins.

1# Both sides large, no broadcast hint → sort-merge
2orders.join(events, "user_id", "inner")

Interview noteThe physical plan shows SortMergeJoin. Look for two Exchange nodes above it (one per side), those are the shuffles.

Left anti join (NOT EXISTS pattern)

Find rows on the left with no match on the right. Faster than a LEFT JOIN followed by WHERE right.col IS NULL because Spark short-circuits as soon as it finds any match.

1# Customers who never placed an order
2customers.join(orders,
3 customers.customer_id == orders.customer_id,
4 "left_anti")

Handling skewed keys with salting

If one key has disproportionately many rows (one customer placed 80% of orders), the partition holding that key becomes a straggler. Salting distributes the hot key across multiple partitions.

1from pyspark.sql.functions import lit, rand, floor, concat
2
3# Salt the hot side
4N_SALTS = 10
5orders_salted = orders.withColumn(
6 "salt", floor(rand() * N_SALTS)
7).withColumn(
8 "salted_key", concat(col("customer_id"), lit("_"), col("salt"))
9)
10
11# Explode the small side across every salt value
12customers_exploded = (
13 customers.crossJoin(spark.range(N_SALTS).withColumnRenamed("id", "salt"))
14 .withColumn("salted_key", concat(col("customer_id"), lit("_"), col("salt")))
15)
16
17result = orders_salted.join(customers_exploded, "salted_key", "inner")

Interview noteSalting is the canonical senior-level answer to skew. Mention Adaptive Query Execution (AQE) auto-skew handling as the modern alternative in Spark 3+.

Bucketed join

If both tables are bucketed on the same key with the same number of buckets, Spark can skip the shuffle on repeated joins. Useful for warehouse tables queried many times.

1# Pre-bucket once
2orders.write.bucketBy(200, "customer_id").saveAsTable("orders_bucketed")
3customers.write.bucketBy(200, "customer_id").saveAsTable("customers_bucketed")
4
5# All future joins on customer_id are shuffle-free
6spark.table("orders_bucketed").join(
7 spark.table("customers_bucketed"),
8 "customer_id"
9)

Interview questions

Spark Joins interview questions

Q1.Explain the difference between a broadcast join and a sort-merge join in Spark.

What they testWhether you understand the shuffle and its cost. A mid-level candidate names both; a senior explains when each applies and why.

ApproachBroadcast ships the smaller DataFrame to every executor, skipping the shuffle. It applies when one side fits under spark.sql.autoBroadcastJoinThreshold. Sort-merge is the default for big-on-big: shuffle both sides by the join key, sort, then merge. Broadcast is 10-100x faster when it applies.

Q2.You have a join between a 2 TB orders table and a 500-row countries table. The job takes 2 hours. What do you check?

What they testAbility to reason about why auto-broadcast did not kick in. 500 rows should be trivially broadcast.

Approach1) Confirm the physical plan shows BroadcastHashJoin, not SortMergeJoin. 2) If not, the countries table's on-disk size may exceed the threshold (compressed size vs in-memory size). 3) Force it with broadcast(countries). 4) Check if AQE is enabled and could have rescued it dynamically.

Q3.How do you detect and fix a skewed join?

What they testSenior-level skill. Needs both detection (Spark UI, partition time distribution) and remediation (salting, AQE, splitting hot keys).

ApproachDetect: Spark UI shows one or two tasks running 10x longer than peers. Partition size histogram shows a long tail. Fix: enable AQE skew join handling (spark.sql.adaptive.skewJoin.enabled=true), salt the hot key, or separately handle the top-N hot keys with a broadcast and union with the rest.

Q4.What's the difference between left_anti and NOT EXISTS?

What they testSQL-to-Spark translation fluency. Also whether you know semi/anti joins are more efficient than LEFT JOIN + IS NULL.

ApproachSemantically identical: left_anti returns rows from the left with no match on the right. Under the hood, Spark can short-circuit as soon as any match is found, unlike LEFT JOIN which joins everything and then filters. On skewed joins, this matters.

Q5.Your join produces duplicate rows. How do you diagnose?

What they testDebugging instincts. The cause is almost always a many-to-many relationship on the join key.

ApproachRun df.groupBy(join_key).count().filter('count > 1').show() on both sides. The side with duplicates is producing the fan-out. Fix by deduplicating upstream, aggregating before joining, or using a semi join if you only need existence.

Q6.When would you use bucketing instead of partitioning for joins?

What they testData warehouse design knowledge. Partitioning is for pruning; bucketing is for shuffle elimination.

ApproachBucketing pre-shuffles the data by a specific key. If both join sides are bucketed on the same key with the same bucket count, the shuffle step is eliminated. Use bucketing for tables joined repeatedly on the same key. Partitioning, by contrast, optimizes read pruning by filter predicate.

Common mistakes

What breaks in practice

Joining on a nullable key without filtering

NULLs do not match NULLs in SQL or Spark joins. If your join key has NULLs, those rows silently disappear (inner join) or produce unmatched nulls (outer). Always decide explicitly.

Wrong

1# Silently drops rows where customer_id is NULL on either side
2orders.join(customers, "customer_id")

Right

1# Make NULL handling explicit
2orders.filter(orders.customer_id.isNotNull()) \
3 .join(customers, "customer_id")

Broadcasting a DataFrame that's too large

broadcast() forces a plan change. If the DataFrame exceeds the driver's memory, the job OOMs on the driver before any work happens.

Wrong

1# countries is actually 5 GB after joins. OOMs driver
2orders.join(broadcast(countries), "code")

Right

1# Check size first
2print(f"Size: {countries.count()} rows")
3# Only broadcast if count and row width are truly small
4if small_enough:
5 orders.join(broadcast(countries), "code")
6else:
7 orders.join(countries, "code")

Using .join(df2) without an explicit condition

Spark accepts .join(df2) with no condition, and produces a cartesian product. On big tables this produces trillions of rows and hangs the cluster.

Wrong

1# Cartesian product, may never finish
2orders.join(customers).count()

Right

1orders.join(customers,
2 orders.customer_id == customers.id,
3 "inner")

FAQ

Common questions

What is a broadcast join in Spark?
A broadcast join ships the smaller DataFrame to every executor, letting each executor join locally without shuffling the larger side. It applies when one side fits under spark.sql.autoBroadcastJoinThreshold (10 MB by default).
What is the default join type in Spark?
inner. If you omit the how parameter, Spark performs an inner join, returning only rows where the join condition matches on both sides.
How do I detect a skewed join?
Check the Spark UI stages page. A skewed join shows one or two tasks running far longer than peers, with a partition size histogram heavily right-tailed. Look for the Exchange node in the physical plan feeding into the SortMergeJoin.
What is the difference between left_semi and inner join?
left_semi returns rows from the left side that have at least one match on the right, without including right-side columns. inner returns all matching row combinations from both sides. Semi is faster when you only need existence.
Can I force Spark to use a sort-merge join?
Yes. Use the SHUFFLE_MERGE hint: df1.hint('shuffle_merge').join(df2, ...). You can also disable auto-broadcast by setting spark.sql.autoBroadcastJoinThreshold = -1.

Continue your prep

Data Engineer Interview Prep, explore the full guide

50+ guides covering every round, company, role, and technology in the data engineer interview loop. Grounded in 2,817 verified interview reports across 929 companies, collected from real candidates.