Dedup at Scale

Concepts: paDeduplication

What They Want to Hear 'At 1 billion rows, ROW_NUMBER works fine with proper partitioning. At 10 billion, MERGE/UPSERT is more efficient: stage the delta, merge against the target. At 100 billion+ or for fuzzy matching, MinHash LSH (locality-sensitive hashing) reduces the comparison space from O(n^2) to near-linear by grouping similar records into buckets.'