PySpark Drop Duplicates: What Interviewers Actually Test
dropDuplicates triggers a full shuffle. distinct hashes all N columns. Window dedup is the only method that gives you deterministic control over which row survives. Interviewers test whether you know the tradeoffs between all three.
dropDuplicates() on Specific Columns
# Interviewers test: do you know which row survives?
# Answer: it is NOT deterministic. The surviving row is arbitrary.
deduped = df.dropDuplicates(["customer_id", "order_date"])
# dropDuplicates hashes only the columns you specify.
# This triggers a full shuffle (wide transformation).
# Shuffle cost scales with the number of columns you carry,
# not the number of columns in the dedup key.dropDuplicates triggers a full shuffle. Spark redistributes every row by the specified columns so identical keys land on the same partition. The row that survives is arbitrary. If your interviewer asks "which record do you keep?", the answer is: you do not control it without a window function.
distinct() for Full-Row Deduplication
# distinct() hashes ALL columns in the DataFrame.
# On a 40-column table, that means hashing 40 fields per row.
deduped = df.distinct()
# Equivalent to dropDuplicates with no arguments:
deduped = df.dropDuplicates()
# If you only care about 3 columns being unique,
# distinct() wastes work hashing the other 37.
# Use dropDuplicates(["col1", "col2", "col3"]) instead.distinct() hashes all N columns. On wide tables this is expensive. The tradeoff: distinct guarantees full-row uniqueness, while dropDuplicates(subset) only guarantees uniqueness on the subset columns. Interviewers test whether you understand that difference and pick the right one.
Window Function Deduplication (Deterministic Row Selection)
from pyspark.sql import functions as F
from pyspark.sql.window import Window
# Interviewers expect this pattern for "keep the latest record"
w = Window.partitionBy("customer_id").orderBy(F.desc("updated_at"))
deduped = (
df
.withColumn("rn", F.row_number().over(w))
.filter(F.col("rn") == 1)
.drop("rn")
)
# Why row_number and not rank?
# rank() assigns the same number to ties, so you might keep 2+ rows.
# row_number() always picks exactly one row per partition.This is the only dedup method that gives you deterministic control over which row survives. You choose the orderBy column. Interviewers look for: partitionBy on the dedup key, orderBy on the tiebreaker, row_number (not rank), and filter to rn == 1. If you use rank instead of row_number, you can still get duplicates when ties exist.
Drop Duplicate Columns After a Join
# After a join, both tables contribute their join key column.
# This creates ambiguous references.
result = orders.join(customers, orders.id == customers.id)
# result now has two "id" columns
# Fix 1: Use list syntax. Spark keeps only one copy of the key.
result = orders.join(customers, on=["id"], how="inner")
# Fix 2: Drop the duplicate explicitly.
result = orders.join(customers, orders.id == customers.id) \
.drop(customers.id)
# Fix 3: Alias and select.
result = orders.alias("o").join(customers.alias("c"), "id") \
.select("o.*", F.col("c.name").alias("customer_name"))Duplicate column names after joins break downstream operations like .select("id"). The list syntax on=["id"] is the cleanest fix because Spark automatically deduplicates the join key. A strong interview answer mentions all three approaches and explains when each is appropriate.
PySpark Drop Duplicates FAQ
What is the difference between dropDuplicates and distinct in PySpark?+
Is dropDuplicates deterministic in PySpark?+
Does dropDuplicates trigger a shuffle?+
Practice PySpark Dedup Patterns Before Your Interview
DataDriven has PySpark challenges that test dropDuplicates, window dedup, and join deduplication against real datasets.
Start Practicing