# The Median Keeper

> The middle value keeps moving as new data arrives.

Canonical URL: <https://datadriven.io/problems/the_median_keeper>

Domain: Python · Difficulty: hard · Seniority: L5

## Problem

Given a stream of integers, return a list of the running medians (average of two middles for even count). Use two heaps (max-heap + min-heap) for O(log n) insertion.

## Worked solution and explanation

### Why this problem exists in real interviews

This is a classic **two-heap problem** that tests data structure selection for streaming computations. It reveals whether candidates understand that maintaining sorted order is too expensive for streaming, and that two heaps can provide O(log n) insertion with O(1) median retrieval.

> **Trick to Solving**
>
> Split the stream into two halves using a max-heap (for the lower half) and a min-heap (for the upper half). The median is always at the top of one or both heaps. Python only has a min-heap, so negate values for the max-heap.

---

### Break down the requirements

#### Step 1: Maintain two heaps: max-heap for lower half, min-heap for upper half

The max-heap holds elements less than or equal to the median. The min-heap holds elements greater. Python's `heapq` is a min-heap, so negate values for the max-heap.

#### Step 2: Insert each value into the correct heap

Push to the max-heap first, then rebalance by moving the max-heap's top to the min-heap if needed.

#### Step 3: Rebalance after each insertion

The heaps should differ in size by at most 1. If the min-heap grows larger, move its top back to the max-heap.

#### Step 4: Compute the median after each insertion

If sizes are equal, average the two tops. If unequal, the median is the top of the larger heap.

---

### The solution

**Two-heap streaming median tracker**

```python
import heapq
def running_medians(stream):
    lo = []
    hi = []
    result = []
    for num in stream:
        heapq.heappush(lo, -num)
        heapq.heappush(hi, -heapq.heappop(lo))
        if len(hi) > len(lo):
            heapq.heappush(lo, -heapq.heappop(hi))
        if len(lo) > len(hi):
            median = -lo[0]
        elif len(lo) == len(hi):
            median = (-lo[0] + hi[0]) / 2
        result.append(median)
    return result
```

> **Time and Space Complexity**
>
> **Time:** O(n log n) total. Each insertion involves up to 3 heap operations, each O(log n).
> 
> **Space:** O(n) for the two heaps combined.

> **Interviewers Watch For**
>
> Can you explain why negation simulates a max-heap? Python only provides `heapq` (min-heap), so storing `-x` means the smallest negated value corresponds to the largest original value.

> **Common Pitfall**
>
> Pushing to the wrong heap first. The algorithm depends on always pushing to `lo` first, then moving to `hi`, then rebalancing. Changing this order breaks the invariant.

---

## Common follow-up questions

- What if you needed to support removing elements from the stream? _(Tests lazy deletion with a hash map tracking removed elements.)_
- How would you handle this if the stream had billions of elements? _(Tests approximate median algorithms like t-digest or reservoir sampling.)_
- What if medians needed to be computed over a sliding window? _(Tests combining the two-heap approach with window expiration logic.)_
- Why not use a balanced BST instead of two heaps? _(Tests trade-off analysis: BSTs give O(log n) median too but with higher constant factors in Python.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/the_median_keeper)
- [Python Interview Questions](https://datadriven.io/python-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.