Loading section...
How Do You Handle Duplicates?
Probabilistic Dedup at Scale At intermediate, dedup is ROW_NUMBER over a table that fits in memory. The next question the interviewer asks: 'How do you deduplicate 10 billion events per day when holding all keys in memory is impossible?' This is where you need to talk about probabilistic data structures. The trap is saying 'just use a bigger cluster.' The senior signal is naming Bloom filters and explaining the false-positive tradeoff. A Bloom filter answers one question: 'Have I seen this key before?' It says 'definitely not' or 'probably yes' - never a false negative. For dedup: if the Bloom filter says new, it's guaranteed new. If it says duplicate, there's a small configurable chance of a false positive. The interviewer wants to hear you explain this tradeoff and say what false-positiv