PySpark Drop Duplicates and Window Dedup (2026)

dropDuplicates triggers a full shuffle. distinct hashes all N columns. Window dedup is the only method that gives you deterministic control over which row survives. Interviewers test whether you know the tradeoffs between all three.

dropDuplicates() on Specific Columns

# Interviewers test: do you know which row survives?
# Answer: it is NOT deterministic. The surviving row is arbitrary.
deduped = df.dropDuplicates(["customer_id", "order_date"])

# dropDuplicates hashes only the columns you specify.
# This triggers a full shuffle (wide transformation).
# Shuffle cost scales with the number of columns you carry,
# not the number of columns in the dedup key.

dropDuplicates triggers a full shuffle. Spark redistributes every row by the specified columns so identical keys land on the same partition. The row that survives is arbitrary. If your interviewer asks "which record do you keep?", the answer is: you do not control it without a window function.

distinct() for Full-Row Deduplication

# distinct() hashes ALL columns in the DataFrame.
# On a 40-column table, that means hashing 40 fields per row.
deduped = df.distinct()

# Equivalent to dropDuplicates with no arguments:
deduped = df.dropDuplicates()

# If you only care about 3 columns being unique,
# distinct() wastes work hashing the other 37.
# Use dropDuplicates(["col1", "col2", "col3"]) instead.

distinct() hashes all N columns. On wide tables this is expensive. The tradeoff: distinct guarantees full-row uniqueness, while dropDuplicates(subset) only guarantees uniqueness on the subset columns. Interviewers test whether you understand that difference and pick the right one.

Window Function Deduplication (Deterministic Row Selection)

from pyspark.sql import functions as F
from pyspark.sql.window import Window

# Interviewers expect this pattern for "keep the latest record"
w = Window.partitionBy("customer_id").orderBy(F.desc("updated_at"))

deduped = (
    df
    .withColumn("rn", F.row_number().over(w))
    .filter(F.col("rn") == 1)
    .drop("rn")
)

# Why row_number and not rank?
# rank() assigns the same number to ties, so you might keep 2+ rows.
# row_number() always picks exactly one row per partition.

This is the only dedup method that gives you deterministic control over which row survives. You choose the orderBy column. Interviewers look for: partitionBy on the dedup key, orderBy on the tiebreaker, row_number (not rank), and filter to rn == 1. If you use rank instead of row_number, you can still get duplicates when ties exist.

Drop Duplicate Columns After a Join

# After a join, both tables contribute their join key column.
# This creates ambiguous references.
result = orders.join(customers, orders.id == customers.id)
# result now has two "id" columns

# Fix 1: Use list syntax. Spark keeps only one copy of the key.
result = orders.join(customers, on=["id"], how="inner")

# Fix 2: Drop the duplicate explicitly.
result = orders.join(customers, orders.id == customers.id) \
    .drop(customers.id)

# Fix 3: Alias and select.
result = orders.alias("o").join(customers.alias("c"), "id") \
    .select("o.*", F.col("c.name").alias("customer_name"))

Duplicate column names after joins break downstream operations like .select("id"). The list syntax on=["id"] is the cleanest fix because Spark automatically deduplicates the join key. A strong interview answer mentions all three approaches and explains when each is appropriate.

Prepare for the interview

01 / Open invite

02min.

Know PySpark deduplication the way the interviewer who asks it knows it.

a PySpark deduplication query, the same shape a screen would give you.

The diff against expected. Where ties broke. What you missed.

sandbox

1def sessionize(events):

2 sessions = []

3 for e in events:

4 if gap_minutes(e) > 30:

Execute your solution0.4s avg.

SalesforceInterview question

Solve a PySpark deduplication problem

PySpark Drop Duplicates FAQ

What is the difference between dropDuplicates and distinct in PySpark?+

distinct() hashes all columns in the DataFrame. dropDuplicates() can hash a subset. On a 40-column table where you only need uniqueness on 3 columns, dropDuplicates(["col1", "col2", "col3"]) avoids hashing the other 37 columns. When called with no arguments, dropDuplicates behaves identically to distinct.

Is dropDuplicates deterministic in PySpark?+

No. When multiple rows share the same key values, the surviving row is arbitrary. For deterministic deduplication (e.g., keep the most recent record), use window functions with row_number() and an explicit orderBy.

Does dropDuplicates trigger a shuffle?+

Yes. dropDuplicates triggers a full shuffle (wide transformation). Spark must redistribute data by the dedup columns to compare rows. The shuffle volume depends on how many columns you carry through, not just the dedup key columns. Filter and select before dedup to reduce shuffle size.

02 / Why practice

Practice PySpark Dedup Patterns Before Your Interview

01
Active recall beats re-reading by 50%
Cognitive-science meta-reviews (Dunlosky et al., 2013) rank practice testing as a top-tier study technique, while re-reading and highlighting rank near the bottom
02
76% of hiring managers reject on the coding task, not the resume
From HackerRank's 2024 Developer Skills Report. Candidates who look strong on paper still fail the live screen if they haven't done timed, executable practice
03
Five problem shapes cover 80% of data engineer loops
Parsing and reshaping, sessionization, dedup with tie-breaks, streaming aggregation, top-N-per-group. Writing them by hand turns the unfamiliar into pattern recognition

Open the problems

Related PySpark Dedup Guides

PySpark Joins Guide→

Join types, broadcast joins, and duplicate column handling

PySpark GroupBy Guide→

Aggregations, shuffle boundaries, and skew handling

PySpark Interview Questions→

The questions data engineers face in real interviews