Loading section...

External Sort and K-Way Merge: Design at Scale

Concepts: pyExternalSort, pyKWayMergeDesign, pyMergePass

External sort is one of the most important algorithms in data engineering that almost no one can fully explain on demand. It is how ORDER BY works when the sort spills to disk in Postgres. It is how Spark shuffle merge works. It is how Hadoop reduces sorted partitions. The core is a two-phase algorithm: sort phase (produce sorted chunks) and merge phase (K-way merge all chunks). The heap is the central data structure of the merge phase. At the staff level, you are expected to reason about chunk sizing, I/O buffer management, and the number of merge passes. Phase 1: Sort Phase — Producing Sorted Runs Phase 2: Merge Phase — K-Way Heap Merge Design Considerations: Chunk Size and Merge Passes The magic number for a single-pass external sort is: if the file fits in chunk_size * K memory where K