What is a broadcast join in Spark?

A broadcast join ships the smaller DataFrame to every executor, letting each executor join locally without shuffling the larger side. It applies when one side fits under spark.sql.autoBroadcastJoinThreshold (10 MB by default).

What is the default join type in Spark?

inner. If you omit the how parameter, Spark performs an inner join, returning only rows where the join condition matches on both sides.

How do I detect a skewed join?

Check the Spark UI stages page. A skewed join shows one or two tasks running far longer than peers, with a partition size histogram heavily right-tailed. Look for the Exchange node in the physical plan feeding into the SortMergeJoin.

What is the difference between left_semi and inner join?

left_semi returns rows from the left side that have at least one match on the right, without including right-side columns. inner returns all matching row combinations from both sides. Semi is faster when you only need existence.

Can I force Spark to use a sort-merge join?

Yes. Use the SHUFFLE_MERGE hint: df1.hint('shuffle_merge').join(df2, ...). You can also disable auto-broadcast by setting spark.sql.autoBroadcastJoinThreshold = -1.

Spark · Interview concept

Spark Joins: Broadcast, Shuffle, and Skew

Spark joins look like SQL joins but execute as distributed shuffles. Choosing broadcast vs shuffle, handling skewed keys, and reading the physical plan are the interview-relevant skills.

Broadcast join on a small dimension table

Edit the PySpark code and click Run. It transpiles to SQL and runs in your browser, expand the transpiled-SQL panel to see what it produced.

	from pyspark.sql import SparkSession
	from pyspark.sql.functions import broadcast, col, sum as F_sum

	spark = SparkSession.builder.getOrCreate()

	orders = spark.table("orders")
	countries = spark.table("countries")

	# broadcast() tells the planner: ship countries to every executor,
	# avoid a shuffle of orders.
	result = (
	orders
	.join(broadcast(countries), orders.country == countries.code, "inner")
	.groupBy("region")
	.agg(F_sum("amount").alias("revenue"))
	)

	result.show()

Overview

What Spark Joins is and why it matters

A Spark join reads two DataFrames, shuffles their rows so matching keys land on the same executor, and produces combined rows. The shuffle is the expensive part: it writes the data to disk, serializes it across the network, and re-partitions it by the join key.

If one side is small enough to fit in each executor's memory, Spark can broadcast it instead. Broadcast joins skip the shuffle entirely and are 10-100x faster. Knowing when broadcast applies and how to force or prevent it is the core of Spark join mastery.

Syntax

Spark Joins reference

Token	Meaning
df1.join(df2, cond, 'inner')	Standard inner join on a boolean condition. 'inner' is the default.
'left' / 'right' / 'outer'	Outer join variants. 'outer' is the SQL FULL OUTER JOIN.
'left_anti'	Returns rows from the left side that have no match on the right. The SQL NOT EXISTS pattern.
'left_semi'	Returns rows from the left side that have at least one match on the right, without right columns.
broadcast(df)	Hint to the planner: broadcast this side instead of shuffling. Fails if the side exceeds spark.sql.autoBroadcastJoinThreshold.

Prepare for the interview

01 / Open invite

02min.

Know Spark Joins the way the interviewer who asks it knows it.

a Spark Joins query, the same shape a screen would give you.

The diff against expected. Where ties broke. What you missed.

sandbox

1SELECT user_id,

2 COUNT(*) AS sessions

3FROM events

4WHERE ts >= NOW() - INTERVAL '7 day'

Execute your solution0.4s avg.

DoorDashInterview question

Solve a Spark Joins problem

Patterns

How Spark Joins shows up in practice

Broadcast join for small dimension tables

When one side is small (under ~10-50 MB), broadcasting skips the shuffle. Spark auto-broadcasts below the threshold, but explicit broadcast() makes intent clear and guarantees the plan.

	orders.join(broadcast(countries),
	orders.country == countries.code,
	"inner")

Interview notespark.sql.autoBroadcastJoinThreshold defaults to 10 MB. Interviewers will ask how Spark decides between broadcast and shuffle, the answer is this threshold plus the join hint.

Sort-merge join (the shuffle default)

When both sides are large, Spark uses sort-merge: shuffle by key, sort each partition, then merge. This is the default join strategy for big-on-big joins.

	# Both sides large, no broadcast hint → sort-merge
	orders.join(events, "user_id", "inner")

Interview noteThe physical plan shows SortMergeJoin. Look for two Exchange nodes above it (one per side), those are the shuffles.

Left anti join (NOT EXISTS pattern)

Find rows on the left with no match on the right. Faster than a LEFT JOIN followed by WHERE right.col IS NULL because Spark short-circuits as soon as it finds any match.

	# Customers who never placed an order
	customers.join(orders,
	customers.customer_id == orders.customer_id,
	"left_anti")

Handling skewed keys with salting

If one key has disproportionately many rows (one customer placed 80% of orders), the partition holding that key becomes a straggler. Salting distributes the hot key across multiple partitions.

	from pyspark.sql.functions import lit, rand, floor, concat

	# Salt the hot side
	N_SALTS = 10
	orders_salted = orders.withColumn(
	"salt", floor(rand() * N_SALTS)
	).withColumn(
	"salted_key", concat(col("customer_id"), lit("_"), col("salt"))
	)

	# Explode the small side across every salt value
	customers_exploded = (
	customers.crossJoin(spark.range(N_SALTS).withColumnRenamed("id", "salt"))
	.withColumn("salted_key", concat(col("customer_id"), lit("_"), col("salt")))
	)

	result = orders_salted.join(customers_exploded, "salted_key", "inner")

Interview noteSalting is the canonical senior-level answer to skew. Mention Adaptive Query Execution (AQE) auto-skew handling as the modern alternative in Spark 3+.

Bucketed join

If both tables are bucketed on the same key with the same number of buckets, Spark can skip the shuffle on repeated joins. Useful for warehouse tables queried many times.

	# Pre-bucket once
	orders.write.bucketBy(200, "customer_id").saveAsTable("orders_bucketed")
	customers.write.bucketBy(200, "customer_id").saveAsTable("customers_bucketed")

	# All future joins on customer_id are shuffle-free
	spark.table("orders_bucketed").join(
	spark.table("customers_bucketed"),
	"customer_id"
	)

Interview questions

Spark Joins interview questions

Q1.Explain the difference between a broadcast join and a sort-merge join in Spark.

What they testWhether you understand the shuffle and its cost. A mid-level candidate names both; a senior explains when each applies and why.

ApproachBroadcast ships the smaller DataFrame to every executor, skipping the shuffle. It applies when one side fits under spark.sql.autoBroadcastJoinThreshold. Sort-merge is the default for big-on-big: shuffle both sides by the join key, sort, then merge. Broadcast is 10-100x faster when it applies.

Q2.You have a join between a 2 TB orders table and a 500-row countries table. The job takes 2 hours. What do you check?

What they testAbility to reason about why auto-broadcast did not kick in. 500 rows should be trivially broadcast.

Approach1) Confirm the physical plan shows BroadcastHashJoin, not SortMergeJoin. 2) If not, the countries table's on-disk size may exceed the threshold (compressed size vs in-memory size). 3) Force it with broadcast(countries). 4) Check if AQE is enabled and could have rescued it dynamically.

Q3.How do you detect and fix a skewed join?

What they testSenior-level skill. Needs both detection (Spark UI, partition time distribution) and remediation (salting, AQE, splitting hot keys).

ApproachDetect: Spark UI shows one or two tasks running 10x longer than peers. Partition size histogram shows a long tail. Fix: enable AQE skew join handling (spark.sql.adaptive.skewJoin.enabled=true), salt the hot key, or separately handle the top-N hot keys with a broadcast and union with the rest.

Q4.What's the difference between left_anti and NOT EXISTS?

What they testSQL-to-Spark translation fluency. Also whether you know semi/anti joins are more efficient than LEFT JOIN + IS NULL.

ApproachSemantically identical: left_anti returns rows from the left with no match on the right. Under the hood, Spark can short-circuit as soon as any match is found, unlike LEFT JOIN which joins everything and then filters. On skewed joins, this matters.

Q5.Your join produces duplicate rows. How do you diagnose?

What they testDebugging instincts. The cause is almost always a many-to-many relationship on the join key.

ApproachRun df.groupBy(join_key).count().filter('count > 1').show() on both sides. The side with duplicates is producing the fan-out. Fix by deduplicating upstream, aggregating before joining, or using a semi join if you only need existence.

Q6.When would you use bucketing instead of partitioning for joins?

What they testData warehouse design knowledge. Partitioning is for pruning; bucketing is for shuffle elimination.

ApproachBucketing pre-shuffles the data by a specific key. If both join sides are bucketed on the same key with the same bucket count, the shuffle step is eliminated. Use bucketing for tables joined repeatedly on the same key. Partitioning, by contrast, optimizes read pruning by filter predicate.

Common mistakes

What breaks in practice

Joining on a nullable key without filtering

NULLs do not match NULLs in SQL or Spark joins. If your join key has NULLs, those rows silently disappear (inner join) or produce unmatched nulls (outer). Always decide explicitly.

Wrong

	# Silently drops rows where customer_id is NULL on either side
	orders.join(customers, "customer_id")

Right

	# Make NULL handling explicit
	orders.filter(orders.customer_id.isNotNull()) \
	.join(customers, "customer_id")

Broadcasting a DataFrame that's too large

broadcast() forces a plan change. If the DataFrame exceeds the driver's memory, the job OOMs on the driver before any work happens.

Wrong

	# countries is actually 5 GB after joins. OOMs driver
	orders.join(broadcast(countries), "code")

Right

	# Check size first
	print(f"Size: {countries.count()} rows")
	# Only broadcast if count and row width are truly small
	if small_enough:
	orders.join(broadcast(countries), "code")
	else:
	orders.join(countries, "code")

Using .join(df2) without an explicit condition

Spark accepts .join(df2) with no condition, and produces a cartesian product. On big tables this produces trillions of rows and hangs the cluster.

Wrong

	# Cartesian product, may never finish
	orders.join(customers).count()

Right

	orders.join(customers,
	orders.customer_id == customers.id,
	"inner")

Prepare for the interview

03 / From the bank03 of many

03hand-picked.

Smooth Latency

Medium5 min

Noisy latency readings, smoothed into a trend you can trust.

Open the sandbox

Pick your level

LyftL3Monthly Signup Counts PinterestL4Ad Clickers AmazonL5The Clean Aisle Numbers

Pulled from debriefs where SQL was the gate.

FAQ

Common questions

What is a broadcast join in Spark?: A broadcast join ships the smaller DataFrame to every executor, letting each executor join locally without shuffling the larger side. It applies when one side fits under spark.sql.autoBroadcastJoinThreshold (10 MB by default).
What is the default join type in Spark?: inner. If you omit the how parameter, Spark performs an inner join, returning only rows where the join condition matches on both sides.
How do I detect a skewed join?: Check the Spark UI stages page. A skewed join shows one or two tasks running far longer than peers, with a partition size histogram heavily right-tailed. Look for the Exchange node in the physical plan feeding into the SortMergeJoin.
What is the difference between left_semi and inner join?: left_semi returns rows from the left side that have at least one match on the right, without including right-side columns. inner returns all matching row combinations from both sides. Semi is faster when you only need existence.
Can I force Spark to use a sort-merge join?: Yes. Use the SHUFFLE_MERGE hint: df1.hint('shuffle_merge').join(df2, ...). You can also disable auto-broadcast by setting spark.sql.autoBroadcastJoinThreshold = -1.

Interview Prep

Data Engineer Interview Prep Guide→

50+ guides covering every round, company, and role

Interview Rounds

SQL Round Python Round System Design Data Modeling Behavioral Live Coding

By Company

Stripe Airbnb Netflix Uber Databricks Snowflake

By Role

Senior DE Staff DE Analytics Engineer ML Data Engineer Junior DE

Question Sets

Top 100 Questions FAANG Questions SQL Questions 50 Questions Take-Home Examples

Published 2026-04-20Updated 2026-04-20

Broadcast join on a small dimension table

What Spark Joins is and why it matters

Spark Joins reference

Know Spark Joins the way the interviewer who asks it knows it.

How Spark Joins shows up in practice

Broadcast join for small dimension tables

Sort-merge join (the shuffle default)

Left anti join (NOT EXISTS pattern)

Handling skewed keys with salting

Bucketed join

Spark Joins interview questions

Q1.Explain the difference between a broadcast join and a sort-merge join in Spark.

Q2.You have a join between a 2 TB orders table and a 500-row countries table. The job takes 2 hours. What do you check?

Q3.How do you detect and fix a skewed join?

Q4.What's the difference between left_anti and NOT EXISTS?

Q5.Your join produces duplicate rows. How do you diagnose?

Q6.When would you use bucketing instead of partitioning for joins?

What breaks in practice

Joining on a nullable key without filtering

Broadcasting a DataFrame that's too large

Using .join(df2) without an explicit condition

Related concepts

Smooth Latency

Common questions