# Between the Pings

> The fleet never stops talking. Find where each vehicle paused.

Canonical URL: <https://datadriven.io/problems/robotaxi-ping-sessions>

Domain: Python · Difficulty: medium · Seniority: mid

## Problem

A robotaxi fleet streams location pings onto a message bus, and you're handed one batch of them in no particular order. For each vehicle, a session is a run of its activity where consecutive pings are no more than `gap` seconds apart; report how many sessions each vehicle recorded.

## Worked solution and explanation

### Why this problem exists in real interviews

Strip off the fleet-telemetry costume and this is per-key sessionization: bucket an interleaved event stream by vehicle, put each vehicle's events in time order, and count the runs separated by a gap larger than the threshold. Anyone can write the dict-of-lists. The knife that separates candidates is remembering the batch is unordered. The pings come off the bus interleaved and out of sequence, so if you scan them as they arrive you invent gaps that were never there and split one session into three. Sort each bucket first, and the gap test becomes trivial.

---

### Break down the requirements

#### Step 1: Separate the stream into one timeline per vehicle

One pass over the batch, appending each ping's ts to a list keyed by its vehicle_id. setdefault(vehicle_id, []).append(ts) keeps it to a single line and creates the bucket on first sight. You are pulling the interleaved stream apart into one independent timeline per vehicle.

#### Step 2: Sort each timeline before you talk about gaps

The batch is unordered, so before the word 'consecutive' means anything you must sort each bucket ascending. Skip this and every difference you compute is against an arbitrary neighbor in insertion order, not the true previous event in time, and the counts are quietly wrong.

#### Step 3: Count gap-separated runs

Walk the sorted timestamps; each time the difference from the previous one exceeds gap, a new session begins. Sessions equal 1 plus the number of such boundaries, so a vehicle with a single ping is exactly one session and a loop starting at index 1 handles it for free. A difference of exactly gap stays in the same session, because the rule is 'no more than gap'.

---

### The solution

**Bucket by vehicle, sort, scan the gaps**

```python
def count_sessions(pings, gap):
    by_vehicle = {}
    for p in pings:
        by_vehicle.setdefault(p["vehicle_id"], []).append(p["ts"])
    sessions = {}
    for vehicle_id, times in by_vehicle.items():
        times.sort()
        count = 1
        for i in range(1, len(times)):
            if times[i] - times[i - 1] > gap:
                count += 1
        sessions[vehicle_id] = count
    return sessions
```

> **Complexity**
>
> Time is O(n log n): one O(n) pass to bucket, then sorting the buckets, whose per-vehicle costs sum to no more than sorting all n timestamps once. Space is O(n) for the buckets plus O(v) for the result, where v is the number of vehicles. A fleet's daily ping batch is millions of rows; the sort dominates and is unavoidable, because a gap test is inherently ordinal.

> **Interviewers Watch For**
>
> Whether you sort before scanning without being told the data is unordered. Strong candidates say out loud 'these come off a bus, I cannot assume time order' and sort defensively. They also pin the boundary: does a gap of exactly `gap` seconds split a session or not? Naming that ambiguity, then choosing a rule, is a seniority tell.

> **Common Pitfall**
>
> Scanning the raw batch as if it were already sorted, or already contiguous per vehicle. On real interleaved input that fabricates gaps and over-counts sessions. The runner-up mistake is deduping timestamps with a set: it silently discards legitimate repeated pings and, worse, throws away the ordering the entire gap test depends on.

---

## Common follow-up questions

- The daily batch no longer fits in memory. How do you count sessions when you can only stream each vehicle's pings once? _(Tests external sort or a merge over pre-sorted per-vehicle shards, and whether the candidate can keep only running state per vehicle.)_
- Instead of a count, return the start and end timestamp of each vehicle's longest session. _(Tests tracking run boundaries during the scan rather than merely counting them.)_
- Two different vehicles occasionally emit a ping with the same timestamp. Does anything in your approach change? _(Tests that bucketing by key isolates vehicles, so timestamps shared across vehicles never interact.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/robotaxi-ping-sessions)
- [Python Interview Questions](https://datadriven.io/python-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.