# Between the Clicks

> Clicks arrive scrambled. Find where one visit ends and the next begins.

Canonical URL: <https://datadriven.io/problems/between-the-clicks>

Domain: Python · Difficulty: medium · Seniority: mid

## Problem

Our ad-click pipeline ingests events under at-least-once delivery, so a single user's clicks can arrive out of order. Each event has a `user` and a `ts` in seconds; reconstruct each user's sessions, where a session is a stretch of that user's activity with no more than `gap` seconds between one click and the next. For each session, report the user, its first and last timestamp, and how many clicks fell inside it.

## Worked solution and explanation

### Why this problem exists in real interviews

This is sessionization wearing an ad-tech costume. The skill being probed is whether you can carve one key's time-ordered events into sessions by an inter-event gap. Anyone can bucket the events by user. The thing that separates candidates is noticing the word 'out of order': at-least-once delivery means arrival order is NOT time order, so before you can reason about gaps at all you have to restore each user's timeline. Skip that and a click that showed up late lands next to the wrong neighbor, so it reads as a brand-new session or blows a hole in an existing one. Your session count comes out wrong and so does every boundary.

> **Trick to solving**
>
> The whole thing collapses to one global sort. Sort the events by (user, ts) so every user's clicks sit together in ascending time, then make a single forward pass. Because the events are now both grouped and time-ordered, the only thing you ever compare is the gap from the current click to the open session's last click: within gap and same user, extend it; otherwise open a new session.

---

### Break down the requirements

#### Step 1: Sort once, by user then time

sorted(events, key=lambda e: (e['user'], e['ts'])) does both jobs the problem hinges on in a single step: the user component co-locates each person's clicks, and the ts component restores time order inside each user's run. This is the line a weaker answer skips when it just counts per user, throwing away the ordering the rest of the problem lives on.

#### Step 2: Walk the ordered stream, extending or splitting

Keep a running list of session records. For each event, look at the last record: if it belongs to the SAME user and this timestamp is within gap of that record's end, fold the click in (push end forward, bump clicks). Otherwise append a fresh session starting here. The boundary is inclusive: a difference exactly equal to gap stays in the same session.

#### Step 3: Why there is no final flush

Because you mutate the open session in place as each click arrives, the record is already complete the moment its last click is folded in. There is no dangling session to append after the loop, and the user-equality guard in the comparison is what stops one user's trailing click from fusing with the next user's first click when their raw timestamps happen to sit close together.

---

### The solution

**One global sort, then a single gap scan**

```python
def sessionize(events, gap):
    ordered = sorted(events, key=lambda e: (e["user"], e["ts"]))
    sessions = []
    for e in ordered:
        user, ts = e["user"], e["ts"]
        current = sessions[-1] if sessions else None
        if current and current["user"] == user and ts - current["end"] <= gap:
            current["end"] = ts
            current["clicks"] += 1
        else:
            sessions.append({"user": user, "start": ts, "end": ts, "clicks": 1})
    return sessions
```

> **Complexity**
>
> The single sort dominates at O(n log n); the forward scan is one O(n) pass and space is O(n) for the output records. Sorting by (user, ts) once is exactly as cheap as grouping into a dict and sorting each bucket, because the per-user sort costs sum to the same n log n, but it is one line and one loop instead of two. At ad-click scale this is the per-partition step of a Spark/Beam job after the stream is keyed by user, and the same sort-then-scan runs inside each reducer.

**Walk in arrival order (wrong)**

Bucket by user and count, or walk events as they were delivered. A late click at ts 5000 that arrives first, before 1000 and 1500, makes the very first comparison a 3500s jump and splits u1's single early visit incorrectly. The time dimension is gone or scrambled.

**Sort by (user, ts), then scan (right)**

Sorting puts u1 in order 1000, 1500, 5000. Now 1500-1000=500 keeps them together and 5000-1500=3500 opens a new session. Two sessions, correct boundaries, regardless of the order the events were delivered in.

> **Interviewers watch for**
>
> The candidate who says out loud 'arrival order is not event-time order, so I have to sort before I can talk about gaps.' That single sentence is the seniority tell here. The other one is treating the gap boundary deliberately: stating whether a difference exactly equal to gap stays in the session, rather than guessing.

> **Common pitfall**
>
> Dropping the user-equality check in the comparison. Once events are globally sorted, the last session of one user sits right before the first click of the next user. If you only test the gap and not whether the user still matches, two different users whose timestamps happen to fall within gap get fused into one session. The 'current["user"] == user' guard is what keeps each user's run sealed off.

---

## Common follow-up questions

- The stream never ends and events keep arriving late. How do you decide a session is finally closed and safe to emit? _(Pushes toward watermarks and allowed-lateness: hold a session open until the watermark passes its last timestamp plus gap, then emit. Tests real streaming late-data handling.)_
- Duplicates from at-least-once delivery mean the same click can appear twice. How would you make clicks count each physical click once? _(Tests idempotency: dedup on a click_id (or user plus ts) before counting, and the trade-off between exact dedup and approximate structures like a Bloom filter at scale.)_
- A billion clicks a day will not fit in one process. How does this run distributed? _(Tests that the work keys by user, so each user's events land on one reducer and the exact same sort-then-scan runs per partition; only the shuffle is added.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/between-the-clicks)
- [Python Interview Questions](https://datadriven.io/python-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.