Deduplication: Intermediate

Past the ROW_NUMBER + filter pattern, deduplication becomes a pipeline design question. The interviewer assumes you can write the canonical dedup query; what they want to see is whether you have shipped idempotent dedup at scale, handled CDC streams whose source emits the same row twice during failover, and built the metrics that surface dedup-related data quality issues before consumers do. The query is a few lines either way. The design decisions around it determine whether the pipeline is robust against the real conditions production data engineering throws at it.
list
Architect idempotent dedup using MERGE or INSERT ... ON CONFLICT at write time
chart
Handle CDC streams where the same row arrives multiple times due to upstream failover
branch
Build dedup metrics that surface unexpected duplicate volume as data quality alerts
code
Choose between write-time and read-time dedup based on the workload's read-write ratio

ROW_NUMBER Deduplication

Recognize that deduplication questions require you to define what constitutes a duplicate (exact, fuzzy, time-window).

The question that recurs in interviews past the basics: 'design a pipeline that ingests a CDC stream of orders, where the upstream system emits the same order row multiple times during database failovers, and produces a clean orders fact table the analytics layer reads from.' The candidate who writes a ROW_NUMBER + filter query in a one-shot SELECT and stops there is producing a query that works on a snapshot. The candidate who designs the MERGE pattern with idempotent dedup keys, emits a metric for unexpected duplicate volume, and explains why this approach survives multiple ingestion runs without producing different results is designing the pipeline.

Three design decisions this lesson covers

First: write-time dedup. MERGE or INSERT ... ON CONFLICT patterns that dedupe as rows land in the target, so the target is always in a deduped state. Idempotency is the key property: rerunning the ingestion produces the same target rows, not duplicate target rows. Second: CDC-aware dedup. CDC streams emit the same row multiple times during failovers, retries, and reconnections; the dedup discipline catches the duplicates at the ingestion boundary and emits a metric so the platform team sees the volume. Third: data quality metrics. The dedup operation produces a count of removed duplicates; that count is itself a metric, alerted on when it drifts beyond historical norms.
  • MERGE or INSERT ... ON CONFLICT patterns that dedupe as rows land in the target. Idempotency by construction.
  • include the upstream LSN in the dedup key. Failover replays share the LSN; legitimate updates have new ones.
  • emit duplicate count per run; alert on drift. The metric surfaces upstream issues before consumers see them.
You are being tested on production-grade dedup when you hear:
  • "the CDC stream sends the same row twice during failovers"
  • "how do we make this dedup idempotent across pipeline reruns?"
  • "the duplicate count drifted last week; what do we do?"
  • "write-time dedup or read-time dedup, which one for this use case?"
  • "the consumer sees the same order twice in the dashboard"

Why these decisions matter in production

Real pipelines run repeatedly. A reingestion job that produces a different deduped result on rerun is a pipeline that violates idempotency, which means rollbacks are unpredictable and recovery from partial failures requires manual intervention. A CDC stream that delivers duplicates the platform never measures is a stream whose data quality issues surface as consumer-facing bugs rather than upstream alerts. The dedup operation is a hinge point: design it well and the downstream consumers see clean data; design it poorly and every failure mode upstream becomes a customer-visible bug.

Choosing Which Duplicate to Keep (ORDER BY)

Write the standard pattern: ROW_NUMBER() OVER (PARTITION BY dedup_key ORDER BY tiebreaker) = 1.

Write-time dedup means the deduplication happens as rows land in the target table, not when consumers query it. The target is always in a deduped state. The pattern is MERGE on engines that support it (Snowflake, BigQuery, SQL Server, Oracle, Postgres 15+) or INSERT ... ON CONFLICT on Postgres-family engines, with the dedup key matching the source's natural key.

The MERGE pattern for idempotent dedup

1MERGE INTO orders_target target
2USING (
3 SELECT order_id, customer_id, order_date, amount, status, created_at, ROW_NUMBER() OVER (PARTITION BY order_id
4 ORDER BY created_at DESC, source_offset DESC) AS rn
5 FROM orders_staging
6) source
7ON target.order_id = source.order_id
8WHEN MATCHED AND source.rn = 1 AND source.created_at > target.created_at THEN UPDATE SET amount = source.amount, status = source.status, created_at = source.created_at
9WHEN NOT MATCHED AND source.rn = 1 THEN INSERT (order_id, customer_id, order_date, amount, status, created_at
10) VALUES (source.order_id, source.customer_id, source.order_date, source.amount, source.status, source.created_at
11)
12/* MERGE pattern: idempotent dedup of incoming orders */

Reading the pattern

The USING clause includes a ROW_NUMBER that picks the most recent row per order_id. The MERGE clause then UPDATES the target if a row exists and the source is newer, or INSERTS if it does not. The 'AND source.rn = 1' filter on each branch ensures only the picked row enters the MERGE. Idempotency holds because rerunning produces the same set of source-rn=1 rows, which produce the same MERGE actions; the target reaches the same state.
Why ROW_NUMBER in the USING clause is mandatory:
  • Source can have multiple rows per dedup key (CDC duplicates)
  • MERGE with duplicate source keys produces engine-specific behavior
  • ROW_NUMBER + WHERE rn=1 ensures one source row per target
  • Idempotency holds: same source → same MERGE actions → same target state

Why ROW_NUMBER in the USING clause

If the source has multiple rows per order_id (CDC stream emitting duplicates), a MERGE without dedup tries to apply multiple updates to the same target row, which engines treat as an error or as last-write-wins depending on the dialect. The ROW_NUMBER picks one row per key inside the USING clause, so the MERGE sees exactly one source row per target. State this when writing: 'the ROW_NUMBER in USING is mandatory; without it, the MERGE behavior on duplicate source keys is engine-specific and unsafe to rely on.'

INSERT ... ON CONFLICT for Postgres-family

1INSERT INTO orders_target(order_id, customer_id, order_date, amount, status, created_at)
2SELECT DISTINCT
3 ON(order_id) order_id,
4 customer_id,
5 order_date,
6 amount,
7 status,
8 created_at
9FROM orders_staging
10ORDER BY order_id, created_at DESC, source_offset DESC
11 ON CONFLICT(order_id) DO UPDATE SET amount = EXCLUDED.amount, status = EXCLUDED.status, created_at = EXCLUDED.created_at
12WHERE EXCLUDED.created_at > orders_target.created_at ;
DISTINCT ON picks the most recent row per order_id in the SELECT. ON CONFLICT routes to the UPDATE branch when a target row exists, with the WHERE clause preventing older source rows from overwriting newer target rows. Idempotency holds for the same reasons as the MERGE pattern.
Reach for MERGE when
  • The engine supports it (Snowflake, BigQuery, SQL Server, Oracle, Postgres 15+)
  • The dedup logic needs separate UPDATE and INSERT actions
  • The team standardizes on MERGE syntax across pipelines
  • Multi-statement transactions are acceptable
Reach for INSERT ... ON CONFLICT when
  • Engine is Postgres or compatible
  • The dedup logic is mostly upsert (insert or update; no complex branching)
  • The team standardizes on Postgres-family syntax
  • The query needs to be portable across CockroachDB or similar

The WHERE clause that protects newer data

The WHERE EXCLUDED.created_at > orders_target.created_at in the ON CONFLICT clause (or the equivalent in MERGE) is essential. Without it, a delayed CDC message with an older timestamp could overwrite a newer target row with stale data. The clause guarantees the target row only updates if the incoming row is genuinely newer; older messages are silently dropped. State this when writing: 'the timestamp comparison in the UPDATE branch is the protection against out-of-order CDC delivery; it ensures the target reflects the latest state per key.'

DISTINCT vs GROUP BY vs ROW_NUMBER

Know when each deduplication method applies: DISTINCT for exact rows, GROUP BY for aggregation, ROW_NUMBER for keeping specific rows.

CDC streams are the canonical case for production dedup. Debezium, AWS DMS, Fivetran, Stitch, and Snowflake Streams all emit row-level changes from a source database to a downstream system. During failovers, retries, and reconnections, the same logical change can be emitted multiple times with different ingestion timestamps. The dedup pattern has to handle this without dropping legitimate updates and without keeping spurious duplicates.

The CDC duplicate scenario

Imagine the upstream MySQL primary fails over to a replica. The CDC connector reconnects and replays the last batch of changes it had not confirmed. The same order_update event for order_id=5000 arrives twice: once before the failover (created_at = 10:00:00) and once after (created_at = 10:00:00, but with a different ingestion_timestamp = 10:00:30). Both rows look legitimate. Without dedup, both writes apply to the target, which is benign for last-write-wins UPDATEs but breaks the rule that the platform can replay any CDC stream without producing different results.
Why CDC duplicates are a property of distributed systems:
  • Connector reconnects after failover; replays uncommitted batch
  • Same logical change emitted twice with different ingestion timestamps
  • Both look legitimate without LSN-aware dedup
  • Without dedup, replaying the stream produces different target state

The dedup key in CDC

1/* CDC stream dedup: pick one row per (source_key, source_lsn) */
2WITH ranked AS (
3 SELECT
4 source_key,
5 source_lsn, /* log sequence number from the upstream */
6 payload,
7 ingestion_timestamp,
8 ROW_NUMBER() OVER (
9 PARTITION BY source_key, source_lsn
10 ORDER BY ingestion_timestamp
11 ) AS rn
12 FROM cdc_stream
13)
14
15SELECT
16 source_key,
17 source_lsn,
18 payload,
19 ingestion_timestamp
20FROM ranked
21WHERE rn = 1
The dedup key for CDC is (source_key, source_lsn): the natural key plus the upstream's log sequence number. The LSN is unique per upstream commit, so duplicates from connector replay have the same LSN. The ROW_NUMBER + filter keeps one row per (source_key, source_lsn), picking the earliest ingestion (which is also the legitimate one; the duplicate is from the connector replay). State this when designing CDC dedup: 'the dedup key includes the upstream LSN, not just the natural key; without LSN, legitimate updates to the same row look like duplicates.'

LSN-aware MERGE

1MERGE INTO orders_target target
2USING (
3 SELECT *
4 FROM (SELECT payload : order_id :: BIGINT AS order_id, payload : status :: TEXT AS status, source_lsn, ingestion_timestamp, ROW_NUMBER() OVER (PARTITION BY payload : order_id
5 ORDER BY source_lsn DESC) AS rn
6 FROM cdc_stream
7 WHERE event_type IN ('insert', 'update'))
8 WHERE rn = 1
9) source
10ON target.order_id = source.order_id
11WHEN MATCHED AND source.source_lsn > target.last_processed_lsn THEN UPDATE SET status = source.status, last_processed_lsn = source.source_lsn
12WHEN NOT MATCHED THEN INSERT (order_id, status, last_processed_lsn
13) VALUES (source.order_id, source.status, source.source_lsn
14)
15/* MERGE with LSN-aware idempotency */
The target table includes a last_processed_lsn column; the UPDATE branch only fires when the incoming LSN is greater, which means the target only advances. Replaying the same LSN twice produces no-op MERGEs because the WHERE condition fails. Idempotency is now formally guaranteed: any subset of the CDC stream that includes the latest LSN per key produces the same target state.

Deletes in CDC dedup

CDC streams emit delete events too. The dedup pattern has to choose: hard-delete the target row, or soft-delete by setting a deleted_at column. Hard-deletes lose history; soft-deletes preserve it for audit but require every downstream query to filter on deleted_at IS NULL. Production patterns usually soft-delete, with a separate cleanup job hard-deleting rows past a retention window. State this when designing: 'soft-delete via deleted_at column for audit; hard cleanup after the retention window. The downstream queries assume deleted_at IS NULL as a default filter.'
At Square in 2022, the merchant data platform team rebuilt their CDC ingestion after a months-long debugging effort traced a recurring downstream-data-drift bug to LSN-unaware dedup. The original pipeline used (merchant_id, ingestion_timestamp) as the dedup key. Every connector failover replayed events with new ingestion_timestamps, so the dedup considered them distinct rows and applied each. Downstream merchants saw their order counts drift up over time, occasionally by 1-2% per quarter. The fix was to switch the dedup key to (merchant_id, source_lsn) and add the last_processed_lsn column on the target. The bug disappeared because connector replays had the same LSN as the original event; the MERGE became a no-op for replays and only fired for new LSNs. The team's runbook now includes 'every CDC dedup pipeline in this codebase uses LSN-aware dedup; ingestion_timestamp is not sufficient as a uniqueness signal.' The story is not a postmortem in the usual sense; the fix was found through deliberate metric analysis (the duplicate-count metric revealed the failover pattern), not through a customer-visible outage.
Hard-delete on CDC delete event
  • DELETE the target row immediately
  • Loses history; cannot audit what was deleted when
  • Simpler downstream queries (no deleted_at filter)
  • Right when the source treats deletes as 'never existed'
Soft-delete on CDC delete event
  • UPDATE the target row's deleted_at column
  • Preserves history; full audit trail
  • Downstream queries filter deleted_at IS NULL
  • Right when audit matters; separate cleanup handles physical deletion

Partial-Key and Multi-Column Duplicates

Handle near-duplicates: events within N seconds, case-insensitive matching, phonetic similarity.

The dedup operation produces a count of removed duplicates per run. That count is itself a metric. A pipeline that silently dedupes is a pipeline whose data quality issues are invisible; a pipeline that emits the duplicate count and alerts on drift catches upstream regressions before consumers do. This section covers the metric, the alert, and the operational pattern.

The duplicate-count metric

1WITH staging_count AS(SELECT COUNT(*) AS total_rows FROM orders_staging), dedup_count AS(SELECT COUNT(*) AS unique_rows FROM(SELECT DISTINCT ON(order_id, source_lsn) * FROM orders_staging ORDER BY order_id, source_lsn, ingestion_timestamp ASC) t) INSERT INTO pipeline_metrics(pipeline_name, run_ts, total_rows, unique_rows, duplicate_count)
2SELECT
3 'orders_ingest',
4 CURRENT_TIMESTAMP,
5 s.total_rows,
6 d.unique_rows,
7 s.total_rows - d.unique_rows
8FROM staging_count s, dedup_count d ;
Two CTEs compute the source row count and the deduped row count. The INSERT writes a single row into a pipeline_metrics table with all three numbers. Over time, the metric table contains the duplicate count per pipeline run; a dashboard or alert reads from it. The cost is a few extra queries per run; the benefit is observability into the dedup's behavior.

The alert pattern

Alert on drift: 'the duplicate count this run is more than 2x the rolling 30-day median.' A stable CDC stream produces a roughly stable duplicate volume (most failovers happen for similar durations and affect similar batch sizes). A sudden jump suggests an upstream issue: longer-than-usual outage, connector misconfiguration, replay storm. The alert routes to the platform team, who investigates the upstream cause. Without the alert, the same condition silently inflates the staging table and the pipeline absorbs the cost without surfacing the signal.
1/* Alert query: duplicate count drift */
2WITH recent_runs AS (
3 SELECT
4 run_ts,
5 duplicate_count,
6 MEDIAN(
7 duplicate_count
8 ) OVER (
9 ORDER BY run_ts
10 ROWS BETWEEN 30 PRECEDING AND 1 PRECEDING
11 ) AS rolling_30day_median
12 FROM pipeline_metrics
13 WHERE pipeline_name = 'orders_ingest'
14)
15
16SELECT
17 run_ts,
18 duplicate_count,
19 rolling_30day_median
20FROM recent_runs
21WHERE duplicate_count > 2 * rolling_30day_median
22AND rolling_30day_median > 0
23AND run_ts > CURRENT_TIMESTAMP - INTERVAL '24 hours'

Per-key duplicate distribution

Beyond the aggregate count, the distribution of duplicates per key surfaces different upstream issues. A few keys with hundreds of duplicates each (a CDC connector stuck in a loop on one row) is different from many keys with 2-3 duplicates each (normal failover replay). The distribution can be a secondary metric: max duplicates per key, p99 duplicates per key, count of keys with more than N duplicates. Each surfaces a different failure mode.

Dedup metrics as a contract

Publishing dedup metrics turns the dedup operation into a contract with downstream consumers. The consumer knows the historical duplicate rate; the metric drift surfaces upstream issues; the platform team has documented evidence of pipeline health. The metric is the deliverable, alongside the deduped data.

Deterministic Tiebreaks for Stable Output

Discuss idempotent deduplication in ETL: dedup-on-write vs dedup-on-read, MERGE semantics, and exactly-once guarantees.

The last design decision: write-time dedup vs read-time dedup. Write-time stores the deduped target; every consumer reads the deduped data. Read-time stores all rows including duplicates, and each consumer dedupes in their query. Each has a cost; each is right in a different workload. Picking based on read-write ratio is the design conversation.

Write-time dedup

Pros: every consumer reads clean data; no per-query dedup logic; storage is bounded by unique rows. Cons: the ingestion job is responsible for dedup correctness; bugs in the ingestion dedup affect every downstream; rerunning ingestion against a different dedup logic requires recomputing the target. Right for: high read-to-write ratios, multiple downstream consumers, stable dedup logic.

Read-time dedup

Pros: source-of-truth retains all rows including duplicates; consumers can implement different dedup logic per use case (some consumers might want all rows including duplicates for audit); rerunning with different logic is a query change, not a pipeline rerun. Cons: every consumer has to know the dedup pattern; per-query cost is paid per read; storage scales with raw event count, not unique rows. Right for: low read-to-write ratios, audit-heavy workloads, exploratory analytics where dedup logic evolves.
Write-time dedup wins when
  • Many consumers read the same data; one dedup is amortized across them
  • Dedup logic is stable; the discipline is established
  • Storage cost matters; deduped target is much smaller than raw event stream
  • Read latency matters; consumers cannot afford per-query dedup
Read-time dedup wins when
  • Source-of-truth must retain all rows for audit or replay
  • Different consumers want different dedup logic (some include duplicates)
  • Dedup logic is evolving; pipeline rerun cost outweighs per-query cost
  • Storage is cheap; deduped target wouldn't save significant cost

The hybrid pattern

Most production data platforms run a hybrid. The raw event stream is preserved unchanged (read-time semantics; can be replayed). A deduped target table is maintained by an ingestion job for the common-case readers (write-time semantics; fast reads). Specialized consumers query the raw events when they need audit trail or custom dedup; everyone else reads the deduped target. This gives both properties: source-of-truth retention and fast reads. State this when designing: 'raw events are the source of truth; a deduped table is the read surface; consumers pick which one fits their use case.'
Hybrid topology: raw + deduped together
  • Raw event stream preserved unchanged (read-time semantics; can be replayed)
  • Deduped target maintained by an ingestion job (write-time semantics; fast reads)
  • Common-case readers query the deduped target
  • Audit and custom-dedup consumers query the raw stream
  • Both properties: retention and read speed

Why this conversation matters at scale

Dedup at scale is not a SELECT-with-ROW_NUMBER decision; it is a workload-design decision that affects storage cost, read latency, audit capability, and the discipline required from each team that touches the pipeline. The interview tests this not because the SQL is complex, but because the design decisions around it are what determine whether the data platform serves consumers well or generates persistent data quality issues. The candidate who walks the read-write ratio question is the candidate who has owned the pipeline; the candidate who jumps to ROW_NUMBER on every dedup question has solved the SQL but not the design.
Workload-design decision affecting storage, latency, auditDiscipline required from each team that touches the pipelineRead-write ratio drives write-time vs read-timeAudit requirements drive raw retention vs deduped-onlyHybrid is the production default
PUTTING IT ALL TOGETHER

> You are in a data engineering interview at a logistics company. The interviewer asks: 'Design a pipeline that ingests a CDC stream of orders, where the upstream system emits the same order row multiple times during database failovers, and produces a clean orders fact table the analytics layer reads from.'

You frame the layers: raw event stream preserved unchanged; ingestion job MERGEs into the deduped target; analytics reads the target.
The dedup key is (order_id, source_lsn). LSN is the upstream's log sequence number; duplicates from connector replay share the same LSN. ROW_NUMBER picks rn=1 inside the USING clause so the MERGE sees one source row per target.
The MERGE has a timestamp comparison (source_lsn > target.last_processed_lsn) in the UPDATE branch. Out-of-order CDC delivery does not overwrite newer target data.
Follow-up: 'How do you know the dedup is working?' You say: 'Emit a duplicate-count metric per pipeline run. Alert when the count drifts beyond 2x the rolling 30-day median. The metric surfaces upstream issues before consumers see them.'
Follow-up: 'Hard-delete or soft-delete on CDC delete events?' You say: 'Soft-delete by default; deleted_at column preserves audit trail. A separate cleanup job hard-deletes past the retention window. Downstream queries filter deleted_at IS NULL.'
Follow-up: 'Why write-time dedup here?' You say: 'High read-to-write ratio; multiple consumers; stable dedup logic. Read-time would force every consumer to dedupe, paying the per-query cost N times. Hybrid retains the raw events for audit and replay.'
Closing: 'Production dedup is design, not syntax. MERGE for idempotency, LSN in the key for CDC correctness, metrics for observability, soft-delete for audit, hybrid topology for both retention and read speed.'

Real data has duplicates; the interview tests whether you can define "duplicate"

Category
SQL
Difficulty
intermediate
Duration
25 minutes
Challenges
0 hands-on challenges

Topics covered: ROW_NUMBER Deduplication, Choosing Which Duplicate to Keep (ORDER BY), DISTINCT vs GROUP BY vs ROW_NUMBER, Partial-Key and Multi-Column Duplicates, Deterministic Tiebreaks for Stable Output

Lesson Sections

  1. ROW_NUMBER Deduplication (concepts: sqlDistinct)

    Three design decisions this lesson covers First: write-time dedup. MERGE or INSERT ... ON CONFLICT patterns that dedupe as rows land in the target, so the target is always in a deduped state. Idempotency is the key property: rerunning the ingestion produces the same target rows, not duplicate target rows. Second: CDC-aware dedup. CDC streams emit the same row multiple times during failovers, retries, and reconnections; the dedup discipline catches the duplicates at the ingestion boundary and emi

  2. Choosing Which Duplicate to Keep (ORDER BY) (concepts: sqlWindowDedup)

    Write-time dedup means the deduplication happens as rows land in the target table, not when consumers query it. The target is always in a deduped state. The pattern is MERGE on engines that support it (Snowflake, BigQuery, SQL Server, Oracle, Postgres 15+) or INSERT ... ON CONFLICT on Postgres-family engines, with the dedup key matching the source's natural key. The MERGE pattern for idempotent dedup Reading the pattern The USING clause includes a ROW_NUMBER that picks the most recent row per or

  3. DISTINCT vs GROUP BY vs ROW_NUMBER (concepts: sqlWindowDedup)

    CDC streams are the canonical case for production dedup. Debezium, AWS DMS, Fivetran, Stitch, and Snowflake Streams all emit row-level changes from a source database to a downstream system. During failovers, retries, and reconnections, the same logical change can be emitted multiple times with different ingestion timestamps. The dedup pattern has to handle this without dropping legitimate updates and without keeping spurious duplicates. The CDC duplicate scenario Imagine the upstream MySQL prima

  4. Partial-Key and Multi-Column Duplicates (concepts: sqlWindowDedup)

    The dedup operation produces a count of removed duplicates per run. That count is itself a metric. A pipeline that silently dedupes is a pipeline whose data quality issues are invisible; a pipeline that emits the duplicate count and alerts on drift catches upstream regressions before consumers do. This section covers the metric, the alert, and the operational pattern. The duplicate-count metric Two CTEs compute the source row count and the deduped row count. The INSERT writes a single row into a

  5. Deterministic Tiebreaks for Stable Output (concepts: sqlWindowDedup)

    The last design decision: write-time dedup vs read-time dedup. Write-time stores the deduped target; every consumer reads the deduped data. Read-time stores all rows including duplicates, and each consumer dedupes in their query. Each has a cost; each is right in a different workload. Picking based on read-write ratio is the design conversation. Write-time dedup Pros: every consumer reads clean data; no per-query dedup logic; storage is bounded by unique rows. Cons: the ingestion job is responsi