Deduplication: Beginner
Deduplication is the single most common task in real data engineering work. Every CDC stream eventually emits the same row twice during a failover. Every event tracking system has clients that retry on network failures and emit duplicate events. Every CSV import has rows that someone copy-pasted by mistake. Every interview asks some version of 'find or remove the duplicates.' The patterns are small, the failure modes are subtle, and the difference between 'exact duplicate' and 'duplicate by key' is the most common source of bugs. This lesson teaches you to recognize the variant, pick the right tool, and pick the right copy when duplicates exist.
Distinguish exact duplicates from duplicates by key
Reach for DISTINCT for full-row duplicates and ROW_NUMBER + filter for keyed duplicates
Pick the right copy with a deterministic tiebreaker
Detect duplicates as a separate query from removing them
"Remove Duplicates" Requires a Definition
Recognize that deduplication questions require you to define what constitutes a duplicate (exact, fuzzy, time-window).
Here is the question every entry-level interview asks in some form: 'this orders table has duplicate rows; keep one row per (customer_id, order_date) and pick the most recent one.' The candidate who writes SELECT DISTINCT * gets the wrong answer because the rows are not byte-for-byte identical; they differ in created_at or some downstream-only column. The candidate who writes a GROUP BY collapses the row count but loses the per-row data. The candidate who reaches for ROW_NUMBER OVER (PARTITION BY customer_id, order_date ORDER BY created_at DESC) and filters to rn = 1 keeps exactly one row per key, picking the most recent by the tiebreaker column.
The two questions to ask before writing SQL:
- ▸Are the rows byte-for-byte identical, or duplicate by key?
- ▸When duplicates exist, which copy should we keep?
- ▸First question picks the tool (DISTINCT vs ROW_NUMBER)
- ▸Second question picks the tiebreaker (ORDER BY)
Two kinds of duplicate: exact vs by key
An exact duplicate is a row where every column has the same value as another row. Two rows with the same customer_id, same order_date, same amount, same created_at: byte-for-byte identical. DISTINCT removes these cleanly. A duplicate by key is a row where some subset of columns has the same value but other columns differ. Two rows with the same customer_id and order_date, but different created_at and different amount (because one is a correction). DISTINCT cannot help here because the rows are not identical; you have to pick which copy to keep based on a business rule. Almost every interview deduplication question is the second case.
Deduplication powers every data-platform invariant. CDC streams (the same row gets emitted multiple times during database failovers). Event logs (clients retry on network failures and emit duplicate events with the same event_id). Slowly-changing dimensions (multiple versions of an entity over time; pick the latest). Imported data (CSV files where someone copy-pasted a region of rows). Joins on non-unique keys (a one-to-many fan-out that should have been one-to-one). Every data engineer has shipped a dedup bug; the pattern recurs because the right tool depends on which kind of duplicate is in the data.
•Weak opening
- Reaches for DISTINCT without checking whether the rows are identical
- Returns the same row count as the input because the rows differ in some column
- Has to be corrected when the interviewer asks 'why didn't dedup work?'
- Spends time recovering from the wrong tool choice
✓Strong opening
- Asks 'are the duplicates byte-for-byte identical, or just duplicate by key?'
- Picks DISTINCT for the first case; ROW_NUMBER + filter for the second
- Names the tiebreaker column when ROW_NUMBER is the right tool
- Has the right query in twelve lines without backtracking
DISTINCT vs GROUP BY for Whole Rows
Write the standard pattern: ROW_NUMBER() OVER (PARTITION BY dedup_key ORDER BY tiebreaker) = 1.
The simplest case: the rows are byte-for-byte identical, and you want one copy of each. SELECT DISTINCT is the one-line answer. It is the right tool when every column has the same value across duplicate rows, which happens with CSV imports where someone duplicated a region, with UNION ALL queries that should have been UNION, and with simple lookup tables. Anything more complex needs the next section's tool.
| 1 | /* Remove exact duplicates from an imported CSV */ |
| 2 | SELECT DISTINCT |
| 3 | customer_id, |
| 4 | email, |
| 5 | signup_date, |
| 6 | segment |
| 7 | FROM imported_customers |
DISTINCT considers all columns in the SELECT list together. Two rows are duplicates only if every column matches. The result has one row per unique combination of (customer_id, email, signup_date, segment). The row count drops if there were duplicates; the row count stays the same if every row was already unique.
DISTINCT vs UNION vs GROUP BY (the three forms that do the same thing)
Three forms produce the same exact-duplicate removal on a single source. SELECT DISTINCT col1, col2 FROM t. SELECT col1, col2 FROM t GROUP BY col1, col2 (every column in the group-by, no aggregate). SELECT col1, col2 FROM t UNION SELECT col1, col2 FROM t (UNION without ALL deduplicates). All three return the same row count. DISTINCT is the most readable; the other two are valid but read as if they had a different purpose. State this when the interviewer probes: 'I'm using DISTINCT here because it reads as deduplication; GROUP BY would read as aggregation, which is not what I'm doing.'
- the readable form. Names the intent.
- valid but reads as aggregation, not dedup.
- UNION without ALL deduplicates; rarely the right reach for single-source dedup.
NULL has a quirk in DISTINCT: two NULL values are treated as duplicates of each other, even though NULL = NULL is UNKNOWN in three-valued logic. DISTINCT recognizes NULL as a value for deduplication purposes; rows where customer_id is NULL collapse together. This is what you usually want, but state it explicitly: 'NULL values in this column are treated as duplicates of each other by DISTINCT, which is the convenient default but worth knowing if NULL semantics matter for your data.'
•Use DISTINCT when
- The rows are byte-for-byte identical across all SELECT columns
- The dedup is a one-time cleanup (imported CSV, ad-hoc query)
- No tiebreaker is needed; any copy is as good as any other
- Readability matters; DISTINCT names the intent
✓DISTINCT is the wrong reach when
- Rows are duplicates by key but differ in some non-key column
- You need to pick a specific copy (latest, highest version, most complete)
- The dedup is part of a larger query that needs to preserve the picked row's other columns
- The data has timestamps or version markers that should drive the choice
The first follow-up: count the dupes before removing
Before dedup, count what the dedup will remove. A query that loses 10% of rows to DISTINCT is suspicious; the duplicates might be a signal of a real upstream bug rather than benign. Run two queries: the source count, and the DISTINCT count. The difference is the number of dropped rows. If the difference is unexpected, investigate before deduplicating.
| 1 | /* Count duplicates before removing them */ |
| 2 | SELECT |
| 3 | COUNT(*) AS total_rows, |
| 4 | COUNT(DISTINCT |
| 5 | (customer_id, email, signup_date) |
| 6 | ) AS unique_rows, |
| 7 | COUNT(*) - COUNT(DISTINCT |
| 8 | (customer_id, email, signup_date) |
| 9 | ) AS duplicate_count |
| 10 | FROM imported_customers |
COUNT(DISTINCT ...) returns the number of unique combinations of the listed columns. The difference is the duplicate count. State this when the interviewer asks how you would verify a dedup: 'I would count the source rows, count the unique-key rows, and compare. If the difference is unexpected, I investigate before deduplicating.' That investigation-first habit reads as production-experience.
TIP
DISTINCT is the right tool for one specific case: full-row exact duplicates. Reaching for it on duplicates-by-key is the most common dedup mistake. Always ask: 'are the rows byte-for-byte identical, or do they differ in some column?' before picking DISTINCT. If they differ, the next section's tool is what you want.
Identifying the Duplicate Key
Know when each deduplication method applies: DISTINCT for exact rows, GROUP BY for aggregation, ROW_NUMBER for keeping specific rows.
The common case: rows are duplicates by some key (customer_id, order_date) but differ in other columns (created_at, amount, status). DISTINCT does not help. The canonical tool is ROW_NUMBER OVER (PARTITION BY key ORDER BY tiebreaker) filtered to rn = 1. The partition defines the key; the order defines which copy is picked first; the filter keeps only the picked copy. This is the pattern that handles 80% of real-world dedup questions.
| 1 | /* Keep one row per (customer_id, order_date), picking the most recent by created_at */ |
| 2 | WITH ranked AS ( |
| 3 | SELECT |
| 4 | order_id, |
| 5 | customer_id, |
| 6 | order_date, |
| 7 | amount, |
| 8 | status, |
| 9 | created_at, |
| 10 | ROW_NUMBER() OVER ( |
| 11 | PARTITION BY customer_id, order_date |
| 12 | ORDER BY created_at DESC |
| 13 | ) AS rn |
| 14 | FROM orders |
| 15 | ) |
| 16 | |
| 17 | SELECT |
| 18 | order_id, |
| 19 | customer_id, |
| 20 | order_date, |
| 21 | amount, |
| 22 | status, |
| 23 | created_at |
| 24 | FROM ranked |
| 25 | WHERE rn = 1 |
PARTITION BY customer_id, order_date defines the dedup key: rows with the same (customer_id, order_date) are duplicates of each other. ORDER BY created_at DESC picks the tiebreaker: among duplicates, the row with the most recent created_at gets rn = 1, the next most recent gets rn = 2, and so on. The outer WHERE rn = 1 keeps exactly one row per (customer_id, order_date) group: the most recent. The pattern generalizes to any key and any tiebreaker; change the PARTITION BY and ORDER BY clauses to match the question.
Why a CTE is mandatory here
You cannot filter on the rn column in the same SELECT that defines it; SQL evaluates WHERE before window functions. The rn column does not exist yet when WHERE runs. You have to compute rn first in a CTE or subquery, then filter on it in an outer query. The classic failure: putting WHERE rn = 1 in the same SELECT as the ROW_NUMBER expression. The engine throws a syntax error or a logical error depending on the engine; either way the candidate has to rewrite. State this when you write the query: 'I'm wrapping the ROW_NUMBER in a CTE because the outer WHERE cannot filter on a window function output in the same SELECT.'
Why the CTE is mandatory for ROW_NUMBER + filter:
- ▸SQL evaluates WHERE before window functions
- ▸The rn column does not exist when WHERE runs
- ▸Wrap in a CTE; filter rn = 1 in the outer SELECT
- ▸On Snowflake / BigQuery, use QUALIFY to skip the CTE wrapper
Walk through a small example
Suppose orders has three rows with (customer_id=1, order_date=2024-01-01): one created at 09:00 with amount=$100, one at 10:00 with $105, one at 11:00 with $110. The PARTITION BY groups them into one partition. The ORDER BY created_at DESC ranks them: rn=1 is the 11:00 row, rn=2 is the 10:00 row, rn=3 is the 09:00 row. The filter rn = 1 keeps only the 11:00 row. The other rows are dropped. The result has one row per (customer_id, order_date), with the most recent created_at.
| customer_id | order_date | created_at | amount | rn | Kept? |
|---|
| 1 | 2024-01-01 | 09:00 | 100 | 3 | no |
| 1 | 2024-01-01 | 10:00 | 105 | 2 | no |
| 1 | 2024-01-01 | 11:00 | 110 | 1 | YES |
DISTINCT ON, the Postgres shortcut
Postgres has a syntactic shortcut for this pattern: DISTINCT ON (key) ... ORDER BY key, tiebreaker. The engine partitions by the DISTINCT ON columns and keeps the first row per partition based on the ORDER BY. The query is shorter but only works on Postgres and a few derivatives. On every other engine the ROW_NUMBER pattern is the right reach. State the dialect difference when writing: 'on Postgres I would use DISTINCT ON for this; everywhere else the ROW_NUMBER pattern is portable.'
| 1 | SELECT DISTINCT |
| 2 | ON(customer_id, order_date) order_id, |
| 3 | customer_id, |
| 4 | order_date, |
| 5 | amount, |
| 6 | status, |
| 7 | created_at |
| 8 | FROM orders |
| 9 | ORDER BY customer_id, order_date, created_at DESC ; |
The QUALIFY clause shortcut
Snowflake, BigQuery, and Databricks support QUALIFY, which filters on window-function output without a CTE wrapper. The query becomes one statement instead of two.
| 1 | /* Snowflake / BigQuery: filter window output without a CTE */ |
| 2 | SELECT |
| 3 | order_id, |
| 4 | customer_id, |
| 5 | order_date, |
| 6 | amount, |
| 7 | status, |
| 8 | created_at |
| 9 | FROM orders |
| 10 | QUALIFY ROW_NUMBER() OVER ( |
| 11 | PARTITION BY customer_id, order_date |
| 12 | ORDER BY created_at DESC |
| 13 | ) = 1 |
QUALIFY is to window functions what HAVING is to aggregates: a filter that runs after the function has been computed. It is the cleaner form on engines that support it. State the engine support when you write it: 'QUALIFY here because I'm on Snowflake; on Postgres I would use the CTE form.'
The ROW_NUMBER pattern is the workhorse for dedup. Memorize the shape; you will write it in every other interview that involves dirty data. Practice writing it in twelve lines without referencing anything; if you stumble on the syntax, the interviewer is reading you as inexperienced even if the technique is correct.
Keeping One Row Per Key
Handle near-duplicates: events within N seconds, case-insensitive matching, phonetic similarity.
ROW_NUMBER + filter only works if you can tell the engine which copy to pick. The tiebreaker in the ORDER BY is the business decision. Different tiebreakers produce different answers. The candidate at this level is being scored on whether they ask the tiebreaker question before writing the query, and whether they pick a deterministic tiebreaker that produces the same result on every run.
The tiebreaker is a business decision
'Keep the most recent row' is the default for many dedup questions, but other tiebreakers come up: keep the row with the highest version number, keep the row with the most complete data (fewest NULLs), keep the row that came from the highest-trust source, keep the row with the largest amount, keep the row with the most recent status of 'active.' Each tiebreaker is a different business rule, and the right one depends on what the consumer of the deduplicated table needs. Ask before writing: 'when there are duplicates, what's the rule for picking the survivor?' The answer drives the ORDER BY.
•Tiebreaker patterns
- ORDER BY created_at DESC: latest row wins (most common)
- ORDER BY version DESC: highest version wins (SCD type 2 dims)
- ORDER BY source_priority ASC, created_at DESC: trust-then-latest
- ORDER BY CASE WHEN status='active' THEN 0 ELSE 1 END, created_at DESC: active-then-latest
✓Each rule applies when
- Latest wins: CDC streams, event retries, simple dedup
- Highest version: explicit versioning is in the schema
- Trust-then-latest: multiple upstream sources of varying quality
- Active-then-latest: status-aware dedup; prefer in-state rows
Deterministic tiebreaking
The ORDER BY column has to be unique across the rows in the partition, or the engine picks arbitrarily between ties. If two duplicate rows have the same created_at to the second, the engine returns either one at random; the same query against the same data may return different rows on different runs. The fix is a secondary tiebreaker: ORDER BY created_at DESC, order_id ASC. order_id is unique per row, which guarantees determinism. State this rule explicitly: 'I'd add order_id as a secondary tiebreaker so the result is deterministic on tied created_at.'
| 1 | /* Deterministic dedup with primary and secondary tiebreakers */ |
| 2 | WITH ranked AS ( |
| 3 | SELECT |
| 4 | order_id, |
| 5 | customer_id, |
| 6 | order_date, |
| 7 | amount, |
| 8 | created_at, |
| 9 | ROW_NUMBER() OVER ( |
| 10 | PARTITION BY customer_id, order_date |
| 11 | ORDER BY created_at DESC, order_id /* primary, then secondary */ |
| 12 | ) AS rn |
| 13 | FROM orders |
| 14 | ) |
| 15 | |
| 16 | SELECT |
| 17 | order_id, |
| 18 | customer_id, |
| 19 | order_date, |
| 20 | amount, |
| 21 | created_at |
| 22 | FROM ranked |
| 23 | WHERE rn = 1 |
The 'most complete row' tiebreaker
A common variant: when two duplicates exist, keep the row with the fewest NULL values (most complete data). This is harder to express in a single ORDER BY because 'completeness' is a count of non-NULLs across columns. The pattern uses a computed column.
| 1 | /* Pick the most-complete row per key */ |
| 2 | WITH ranked AS ( |
| 3 | SELECT |
| 4 | customer_id, |
| 5 | email, |
| 6 | phone, |
| 7 | address, |
| 8 | created_at, |
| 9 | ( |
| 10 | CASE |
| 11 | WHEN email IS NOT NULL THEN 1 |
| 12 | ELSE 0 |
| 13 | END + CASE |
| 14 | WHEN phone IS NOT NULL THEN 1 |
| 15 | ELSE 0 |
| 16 | END + CASE |
| 17 | WHEN address IS NOT NULL THEN 1 |
| 18 | ELSE 0 |
| 19 | END |
| 20 | ) AS completeness_score, /* Count non-NULL columns; higher means more complete */ |
| 21 | ROW_NUMBER() OVER ( |
| 22 | PARTITION BY customer_id |
| 23 | ORDER BY completeness_score DESC, created_at DESC |
| 24 | ) AS rn |
| 25 | FROM customers |
| 26 | ) |
| 27 | |
| 28 | SELECT |
| 29 | customer_id, |
| 30 | email, |
| 31 | phone, |
| 32 | address, |
| 33 | created_at |
| 34 | FROM ranked |
| 35 | WHERE rn = 1 |
The completeness_score column counts how many of the optional columns are non-NULL. The ORDER BY ranks rows by completeness first, then by created_at as a tiebreaker. Result: the row with the most filled-in data wins; ties on completeness fall back to recency. This pattern is common in customer-master deduplication where multiple upstream sources contribute partial information.
The 'merge instead of pick' alternative
Sometimes the right answer is not to pick one row but to merge values across duplicates. Pick the most recent created_at, the highest amount, the longest email, and the most complete address, even though those come from different duplicate rows. This is the canonical record pattern, expressed with aggregates: GROUP BY the dedup key and MAX/MIN/FIRST_VALUE the other columns.
| 1 | /* Merge duplicates into a canonical record (not pick one row) */ |
| 2 | SELECT |
| 3 | customer_id, |
| 4 | MAX(created_at) AS last_created_at, |
| 5 | MAX(amount) AS max_amount, |
| 6 | MAX( |
| 7 | CASE |
| 8 | WHEN email IS NOT NULL THEN email |
| 9 | END |
| 10 | ) AS email, |
| 11 | COUNT(*) AS source_row_count |
| 12 | FROM orders |
| 13 | GROUP BY customer_id |
This is not deduplication in the strict sense; it is consolidation. Each output row is a synthesis of all duplicate rows for the key. State the distinction when the interviewer asks: 'pick-one and merge-into-canonical are different contracts. Pick-one returns rows from the source; merge-into-canonical may produce rows that did not exist in the source.' That distinction matters when the consumer of the deduplicated table expects 'real' rows versus 'synthesized' rows.
The tiebreaker conversation is what the interviewer is reading you for at the entry level. The SQL pattern (ROW_NUMBER + filter) is mechanical; the choice of tiebreaker is the design call. Ask the question; pick the tiebreaker; defend the choice with the consumer's question. That sequence is the move.
Why DISTINCT Alone Is Often Wrong
Discuss idempotent deduplication in ETL: dedup-on-write vs dedup-on-read, MERGE semantics, and exactly-once guarantees.
Past correctness, the interviewer wants to know whether you treat dedup as a one-step operation or as a multi-step investigation. Detecting duplicates is a different query from removing them. Counting duplicates is a different query from either. Each is useful at a different point in the workflow; mixing them up loses the diagnostic signal that drives the decision.
Detection: finding the duplicates
Before deduplicating, find the duplicates. The query: GROUP BY the dedup key, HAVING COUNT(*) > 1. The result is the list of keys that have more than one row, with the count of how many.
| 1 | /* Find duplicate (customer_id, order_date) pairs */ |
| 2 | SELECT |
| 3 | customer_id, |
| 4 | order_date, |
| 5 | COUNT(*) AS duplicate_count |
| 6 | FROM orders |
| 7 | GROUP BY customer_id, order_date |
| 8 | HAVING COUNT(*) > 1 |
| 9 | ORDER BY duplicate_count DESC, customer_id, order_date |
This query produces the diagnostic view: which keys have duplicates, how many. If the consumer of this query is a human investigating data quality, this is the right output. If you find ten thousand duplicates concentrated on one customer_id, that is a different problem (probably an upstream join bug) than if you find scattered duplicates across many customers (probably client retries). The detection query is the investigation; the dedup query is the cleanup.
The most common dedup gotchas
Three gotchas show up in real data. First: case sensitivity. 'alice@example.com' and 'Alice@Example.com' might be the same email logically but different strings to the database. Normalize before comparing: LOWER(email) for case-insensitive dedup. Second: whitespace. Trailing spaces, tab characters, byte-order marks. TRIM before comparing. Third: NULL handling. NULL in the dedup key means the row will never match another row in that partition (because NULL != NULL in three-valued logic). If you want NULLs to be treated as equal for dedup purposes, COALESCE them to a sentinel value before partitioning.
Case: 'alice@x.com' vs 'Alice@X.com' → LOWERWhitespace: trailing spaces, BOMs → TRIMNULLs: NULL != NULL in partitions → COALESCE to sentinelEach normalization happens before PARTITION BYOtherwise dirty variants do not dedupe together
| 1 | /* Normalize before deduplicating */ |
| 2 | WITH normalized AS ( |
| 3 | SELECT |
| 4 | customer_id, |
| 5 | LOWER(TRIM(email)) AS email_normalized, |
| 6 | TRIM(phone) AS phone_normalized, |
| 7 | COALESCE(address, '') AS address_normalized, |
| 8 | created_at |
| 9 | FROM customers |
| 10 | ), |
| 11 | ranked AS ( |
| 12 | SELECT |
| 13 | *, |
| 14 | ROW_NUMBER() OVER ( |
| 15 | PARTITION BY customer_id, email_normalized |
| 16 | ORDER BY created_at DESC |
| 17 | ) AS rn |
| 18 | FROM normalized |
| 19 | ) |
| 20 | |
| 21 | SELECT |
| 22 | * |
| 23 | FROM ranked |
| 24 | WHERE rn = 1 |
Dedup as an idempotent transform
A good dedup query is idempotent: running it on already-deduplicated input produces the same output. ROW_NUMBER + filter has this property; running it twice does not change the result. DISTINCT has this property. GROUP BY has this property. Idempotency matters for pipelines: if a dedup step runs as part of a refresh, running the refresh again should not change the output (assuming no new upstream data). State idempotency as a property of the chosen tool: 'this dedup pattern is idempotent; rerunning produces the same result.'
At Instacart in 2021, the order-fact pipeline shipped a dedup bug that took six weeks to find. The query used SELECT DISTINCT on a join result that had a deterministic-looking but non-unique column drift (one of the dimension joins fanned out occasionally because the dim table had a soft-delete row that should have been filtered). The DISTINCT collapsed visible duplicates but missed the case where the duplicates differed in a created_at column the consumer cared about. The fix was to switch from DISTINCT to ROW_NUMBER + filter with an explicit tiebreaker, and to add a detection-then-removal pattern so the dedup step also reported any unexpected dup counts to the team. The runbook line was 'any analytical dedup in this codebase uses ROW_NUMBER + filter with a named tiebreaker, and emits a metric for unexpected dup counts. DISTINCT is for one-time CSV cleanups only.' The candidate who names this distinction unprompted reads as someone who has been on the wrong side of a silent-DISTINCT bug.
| Situation | Phrasing that flatlines | Phrasing that lands |
|---|
| You see 'remove duplicates' | "SELECT DISTINCT." | "Two questions first: are the rows byte-for-byte identical, or duplicate by key? And which copy do we keep when there are duplicates?" |
| The interviewer says 'keep the most recent' | "ORDER BY created_at DESC." | "ROW_NUMBER over (PARTITION BY key ORDER BY created_at DESC, tiebreaker_id ASC), filtered to rn = 1. The secondary tiebreaker makes the result deterministic on tied timestamps." |
| The data has NULLs in the key | "I'll filter them out." | "NULL never equals NULL in three-valued logic, so NULLs in the key partition each NULL into its own group. If they should dedupe together, COALESCE to a sentinel before PARTITION BY." |
| The interviewer asks 'how do you verify' | "I'll trust the query." | "GROUP BY the key with HAVING COUNT(*) > 1 to find the duplicate keys before removal. Compare the source row count to the deduped count; an unexpected drop is a signal to investigate before deduplicating." |
| The data is dirty (case, whitespace) | "I'll dedupe as-is." | "Normalize first: LOWER, TRIM, COALESCE. Two strings that differ only by case or whitespace should dedupe together; comparing as-is misses them." |
Close with a four-sentence wrap. 'Deduplication is two questions: are the rows byte-for-byte identical, or duplicate by some key? And which copy do we keep when duplicates exist? For exact duplicates I use DISTINCT; for duplicates by key I use ROW_NUMBER OVER (PARTITION BY key ORDER BY tiebreaker) filtered to rn = 1. I add a secondary tiebreaker on a unique column so the result is deterministic on tied primary tiebreakers, and I normalize the data (LOWER, TRIM, COALESCE) before the dedup if dirty inputs are possible. Detection is a separate query: GROUP BY the key HAVING COUNT(*) > 1; running detection before removal surfaces upstream bugs before the cleanup hides them.' Four sentences. Pattern recognition, tool choice, determinism, detection. The shape generalizes to every dedup question.
❯❯❯PUTTING IT ALL TOGETHER
> You are in a data engineering phone screen at a logistics company. The interviewer asks: 'This orders table has duplicate rows from an upstream CDC stream. Keep one row per (customer_id, order_date), picking the most recent one.'
You say: 'Two questions first. Are the rows byte-for-byte identical, or duplicate by key? And which copy should I pick when there are duplicates?' The interviewer says: 'Duplicate by key; differ in created_at and amount. Pick the most recent by created_at.'
You say: 'Then ROW_NUMBER OVER (PARTITION BY customer_id, order_date ORDER BY created_at DESC, order_id ASC) filtered to rn = 1. The secondary tiebreaker on order_id makes the result deterministic.'
You write the CTE with the ROW_NUMBER and the outer filter; six lines of body.
You trace through a small example: three rows with (customer_id=1, order_date=2024-01-01) created at 09:00, 10:00, 11:00. rn=1 is the 11:00 row; rn=2 and rn=3 are dropped. Result has the most recent.
Follow-up: 'What if I see duplicates in the input?' You say: 'I'd run a detection query first: GROUP BY (customer_id, order_date) HAVING COUNT(*) > 1. If the duplicate counts look unexpected, that's a signal of an upstream bug, and I'd investigate before deduplicating.'
Closing: 'For CDC streams the canonical pattern is ROW_NUMBER + filter, not DISTINCT. The duplicates differ in non-key columns (created_at, amount), and DISTINCT does not generalize to picking one row per key.'