Loading section...
When Counter Fails: Scale, Cardinality, and Streaming
Concepts: pyCounterLimits, pyHyperLogLog, pyApproxCounting
Counter is a hashtable. Hashtables need to hold every distinct key in memory. For a dataset with 100 distinct error codes, Counter is perfect. For a dataset with 10 billion distinct IP addresses, Counter will exhaust your machine's RAM before you finish reading the first partition. Knowing when Counter fails, and what to use instead, is the move that separates data engineers who can talk about algorithms from data engineers who have actually run those algorithms on real data. This is the question that comes at the end of the top-K problem: 'What if the data does not fit in memory?' The Failure Modes of Counter The Alternative: Approximate Counting with HyperLogLog HyperLogLog estimates the count of distinct elements (cardinality) using O(log log n) memory — that is kilobytes for datasets w