Loading section...
K-Way Merge: Scaling Two Pointers
Concepts: pyKWayMerge, pyHeapMerge
Two-pointer merge works for two sorted inputs. What about K sorted inputs? This is the K-way merge, and it is one of the most practically important algorithms in data engineering. Merging K sorted Kafka partitions. Combining K sorted output files from a Spark stage. The external sort merge step. All of these are K-way merge. The naive approach: merge two at a time, K-1 times. Each merge is O(n), and you do K-1 of them, giving O(n*K). The optimal approach: use a min-heap with K entries, one per input. Pop the smallest, advance that input's pointer, push the new value. Each pop and push is O(log K), and you do N total operations (N = total elements across all inputs). Time: O(N log K). This is dramatically better when K is large. The heap maintains exactly K elements at all times (one per in