# The Stream Joiner

> Events don't wait for each other. This does.

Canonical URL: <https://datadriven.io/problems/the_stream_joiner>

Domain: Python · Difficulty: hard · Seniority: L5

## Problem

Given two sorted lists of event dicts (each with 'timestamp' and 'value') and a numeric tolerance, for each event in stream_a find the nearest event in stream_b whose |timestamp_a - timestamp_b| <= tolerance. If such a match exists, output {'a_value', 'b_value', 'gap'}. Events without a match are skipped. Return the output list in input-order of stream_a. If multiple stream_b events tie for nearest, choose the earliest.

## Worked solution and explanation

### Why this problem exists in real interviews

This tests **two-pointer technique on sorted data** combined with **tolerance-based matching**, a pattern used in time-series correlation, sensor fusion, and event stream joins. It probes whether a candidate can efficiently pair events across two sorted streams.

> **Trick to Solving**
>
> Since both streams are sorted by timestamp, use a pointer for stream B that advances as you process events from stream A. For each event in A, scan B from the current pointer position to find the closest match within the tolerance window.

---

### Break down the requirements

#### Step 1: Iterate through stream A events

For each event in A, find the best match in B within the allowed time difference.

#### Step 2: Use a pointer into B to avoid rescanning

Since both streams are sorted, once you move past an element in B that is too early for A[i], it is also too early for all subsequent A events.

#### Step 3: Track the closest match within the tolerance

Among all B events within the window, pick the one with the smallest absolute time difference.

#### Step 4: Return matched pairs

Collect pairs of matched events. Unmatched A events can be skipped or marked.

---

### The solution

**Two-pointer scan with tolerance window**

```python
def join_streams(stream_a: list, stream_b: list, tolerance: float) -> list:
    matches = []
    b_start = 0
    for a_event in stream_a:
        a_time = a_event["timestamp"]
        best_match = None
        best_diff = float("inf")
        for j in range(b_start, len(stream_b)):
            b_time = stream_b[j]["timestamp"]
            diff = abs(a_time - b_time)
            if b_time < a_time - tolerance:
                b_start = j + 1
                continue
            if b_time > a_time + tolerance:
                break
            if diff < best_diff:
                best_diff = diff
                best_match = stream_b[j]
        if best_match is not None:
            matches.append((a_event, best_match))
    return matches
```

> **Time and Space Complexity**
>
> **Time:** O(a + b) amortized, where a and b are the lengths of the two streams. The `b_start` pointer only advances forward, so each element in B is visited at most twice.
> 
> **Space:** O(min(a, b)) for the matches list.

> **Interviewers Watch For**
>
> The advancing `b_start` pointer is the key optimization. Without it, the algorithm degrades to O(a * b). Strong candidates explain why this works with sorted data.

> **Common Pitfall**
>
> Allowing a B event to match multiple A events when it should only match the closest one. Depending on requirements, you may need to track which B events are already consumed.

---

## Common follow-up questions

- What if one B event should only be matched to one A event? _(Tests greedy vs. optimal matching and potentially Hungarian algorithm for global optimum.)_
- How would you handle streams arriving out of order? _(Tests buffering and re-sorting within a watermark window.)_
- What if tolerance varies per event type? _(Tests parameterizing the window lookup per event.)_
- How would this work in a distributed system with multiple stream partitions? _(Tests partitioning strategy and cross-partition join semantics.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/the_stream_joiner)
- [Python Interview Questions](https://datadriven.io/python-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.