Loading lesson...

Heap & Top-K: Intermediate

One heap is a queue. Two heaps is a median machine.

Category: Python
Difficulty: intermediate
Duration: 35 minutes
Challenges: 0 hands-on challenges

Topics covered: K-Way Merge: The Core of External Sort, Sliding Window Median: Two Heaps, Task Scheduler: Greedy + Heap, Dijkstra's Shortest Path: Heaps Power Graph Algorithms, Custom Comparators: Heap with Complex Objects

Lesson Sections

K-Way Merge: The Core of External Sort (concepts: pyKWayMerge, pyExternalSort, pyMergeIterators)
K-way merge is the algorithm that makes Spark shuffle work, that powers the merge phase of MapReduce, and that you implement every time you merge sorted partitions in a data lake. The problem: given K sorted lists (or sorted file chunks), merge them into one sorted output without loading everything into memory. A heap of size K lets you do this in O(N log K) where N is the total number of elements. You only ever have K elements in the heap at once — one from each list. The K-Way Merge Pattern Th
Sliding Window Median: Two Heaps (concepts: pyTwoHeapMedian, pySlidingWindowMedian, pyLazyDeletion)
The two-heap median pattern is the one that gets used as a staff-level filter. If you are interviewing for L5+ data engineer or staff, this problem or a variant of it is likely to appear. The pattern: maintain two heaps — a max-heap for the lower half of seen numbers and a min-heap for the upper half. The median is always derivable from the roots of these two heaps. It is simple in concept and full of edge cases in implementation. Which is exactly why interviewers love it. The Balancing Invarian
Task Scheduler: Greedy + Heap (concepts: pyTaskScheduler, pyGreedyHeap, pyPipelineScheduling)
LeetCode 621: Task Scheduler. Given a list of CPU tasks with a cooldown period n, find the minimum time to execute all tasks. The greedy insight: always execute the most frequent remaining task to minimize idle time. A max-heap of frequencies implements this greedily. This is not just a LeetCode problem. Pipeline job schedulers, Airflow task prioritization, and Kafka consumer group coordination all use variants of this greedy heap scheduling. Understanding it at this level is what lets you speak
Dijkstra's Shortest Path: Heaps Power Graph Algorithms (concepts: pyDijkstra, pyGraphHeap, pyPipelineDAG)
Dijkstra's algorithm is the most important graph algorithm that uses a heap, and it comes up in DE interviews more than most people expect. Not because you are expected to build a routing engine, but because DE systems have dependency graphs: pipeline DAGs, query plans, data lineage graphs. Finding the shortest path through a dependency graph (e.g., minimum-latency data flow path, optimal query plan route) is a real problem. Dijkstra uses a min-heap to always process the cheapest next step. If y
Custom Comparators: Heap with Complex Objects (concepts: pyCustomComparator, pyCmpToKey, pyDataclassLt)
Default heapq works perfectly with numbers and strings. The moment you want to heap-sort custom objects — a Pipeline object, a QueryResult, an Employee — you need to tell Python how to compare them. There are three approaches: implement __lt__ on the dataclass, use functools.cmp_to_key for complex comparison logic, or wrap everything in (priority_value, counter, object) tuples. Each has tradeoffs. Knowing all three, and knowing which to reach for in an interview, is what separates thorough candi