Loading section...

K-Way Merge: The Core of External Sort

Concepts: pyKWayMerge, pyExternalSort, pyMergeIterators

K-way merge is the algorithm that makes Spark shuffle work, that powers the merge phase of MapReduce, and that you implement every time you merge sorted partitions in a data lake. The problem: given K sorted lists (or sorted file chunks), merge them into one sorted output without loading everything into memory. A heap of size K lets you do this in O(N log K) where N is the total number of elements. You only ever have K elements in the heap at once — one from each list. The K-Way Merge Pattern The trick is storing (value, list_index, element_index) tuples in the heap. When you pop the minimum value, you know exactly which list it came from (list_index) and which position to pull the next element from (element_index + 1). You push the next element from that same list. The heap always contain