# Fastest Page View to Click

> How fast from view to click?

Canonical URL: <https://datadriven.io/problems/fastest_page_view_to_click>

Domain: SQL · Difficulty: hard · Seniority: L5

## Problem

Events are logged with event_type values of 'page_view' and 'button_click'. Find the smallest time gap between any page view and the next button click in the global event stream. Show the user who viewed the page, the page view time, the click time, and the gap in seconds. Return only the single fastest transition.

## Worked solution and explanation

### The mental model

Behavioral analytics teams ask this to see whether you instinctively reach for window functions over self-joins. Anyone can write `event_data a JOIN event_data b ON b.ts > a.ts`, but it's O(N^2) and never finishes on 400M rows. The signal is reaching for `LEAD`, and whether you ask 'per user or global stream?' before writing, because the prompt is ambiguous.

---

### The three traps

#### Step 1: Global vs per-user ordering

The expected query writes `LEAD(...) OVER (ORDER BY event_timestamp)` with no `PARTITION BY`. That orders events globally across every user in the table. A `page_view` by Alice at 12:00:00.100 paired with a `button_click` by Bob at 12:00:00.103 will register as a 3 ms transition, even though Bob never saw Alice's page. The prompt's phrase *global event stream* takes this literally, but no real product team wants cross-user attribution. Flag this out loud in the interview before you write the query. The per-user fix is one token: `LEAD(...) OVER (PARTITION BY user_id ORDER BY event_timestamp)`.

#### Step 2: Pre-filter event types before the LEAD

Without `WHERE event_type IN ('page_view', 'button_click')` in the inner CTE, `LEAD` hands you whatever next row exists (a `scroll`, a `route_navigation`). Your outer filter then drops nearly every pair, including real view-to-click pairs separated by an unrelated event. Filtering inside the CTE makes "next event" mean "next view or click", which is what you actually want. Skip this step and you need a window frame or correlated subquery, and the query slows hard.

#### Step 3: Convert JULIANDAY differences to seconds

`(JULIANDAY(t2) - JULIANDAY(t1)) * 86400` is the SQLite idiom for elapsed seconds between two timestamps. Drop the `* 86400` and you get a fraction of a day, so a 2-second gap reads as `0.0000231`. The other failure mode is subsecond precision: if one side stores millisecond precision and the other rounds to whole seconds, JULIANDAY math can quantize the gap below 1 ms in ways that misorder near-ties. For interview purposes, just remember the `* 86400` multiplier and call out the precision assumption.

---

### The solution

**LEAD over a filtered event stream**

```sql
WITH events_ordered AS (
  SELECT user_id, event_type, event_timestamp,
    LEAD(event_type) OVER (ORDER BY event_timestamp) AS next_type,
    LEAD(event_timestamp) OVER (ORDER BY event_timestamp) AS next_ts,
    LEAD(user_id) OVER (ORDER BY event_timestamp) AS next_user
  FROM event_data
  WHERE event_type IN ('page_view', 'button_click')
),
gaps AS (
  SELECT user_id,
    event_timestamp AS page_view_time,
    next_ts AS click_time,
    next_user AS click_user,
    (JULIANDAY(next_ts) - JULIANDAY(event_timestamp)) * 86400 AS gap_seconds
  FROM events_ordered
  WHERE event_type = 'page_view' AND next_type = 'button_click'
)
SELECT user_id, page_view_time, click_time, gap_seconds
FROM gaps
ORDER BY gap_seconds ASC
LIMIT 1
```

> **Where the 400M rows hurt**
>
> `event_data` is 400M rows partitioned by `event_timestamp`. The `IN (...)` filter trims to maybe 80M, but you still sort and window globally. The `ORDER BY event_timestamp` is the bottleneck: the planner must materialize the full sort before `LEAD` fires, and partition pruning does not help for an all-time query. Production fix: `OVER (PARTITION BY user_id ORDER BY event_timestamp)` lets the engine parallelize across users instead of one giant sort.

> **Clarifying questions before you write code**
>
> Surface these before writing. **(a)** Per-user or cross-user transitions? Almost always per-user. **(b)** Within-session or across-session? A click 5 days after a `page_view` is the literal next click but is unrelated; most teams add a max-gap filter. **(c)** Multiple consecutive `page_views` before a click, do they all pair, or only the most recent? `LEAD` naturally takes the most recent: each view looks at the very next event, and the outer `WHERE` drops view-after-view pairs.

> **Do not reach for a self-join**
>
> The instinct is `event_data pv JOIN event_data bc ON bc.event_timestamp > pv.event_timestamp WHERE pv.event_type = 'page_view' AND bc.event_type = 'button_click'`, then `MIN(gap)`. This is O(N^2) in disguise: every `page_view` joins to every later `button_click` before the aggregate filters down to one row. On 80M filtered rows the intermediate Cartesian product is astronomical and the query simply never finishes. `LEAD` is O(N log N) because it sorts once and reads the immediate neighbor. The self-join version times out; the `LEAD` version finishes in minutes.

> **When you need the next event of a specific type**
>
> `LEAD` only sees the literally next row, which is why pre-filtering matters. If you cannot pre-filter (you need to keep other event types around for context), use a windowed conditional min: `MIN(CASE WHEN event_type = 'button_click' THEN event_timestamp END) OVER (PARTITION BY user_id ORDER BY event_timestamp ROWS BETWEEN 1 FOLLOWING AND UNBOUNDED FOLLOWING)`. Heavier than `LEAD`, but it gives you the next click for this user regardless of what events sit between.

---

## Common follow-up questions

- The query orders events globally across all users. A `page_view` by Alice followed in wall-clock order by a `button_click` from Bob counts as a pair. Is that what the product team actually wants, and what would the right semantic be? _(Tests whether the candidate distinguishes literal prompt wording from sensible product semantics, and whether they raise the ambiguity before writing code.)_
- How would you modify the window to compute per-user transitions instead, and what does that change about the query plan on 400M rows? _(Tests fluency with `PARTITION BY` and awareness that per-user partitioning often unlocks parallelism and shorter sort runs.)_
- What if a `page_view` is followed by another `page_view`, then a `button_click`? Does the first `page_view` get paired with the eventual click in this query? _(Tests understanding of how `LEAD` interacts with the outer `WHERE` filter and whether the candidate sees that only the most recent `page_view` before a click survives.)_
- Why is `LEAD` here strictly faster than a self-join, both in big-O terms and in practice on 400M rows? _(Tests whether the candidate can reason about Cartesian explosion versus a single sorted pass, and connects asymptotic complexity to real query timeouts.)_
- Production now wants the median view-to-click gap, not the minimum. How does the bottom of the query change, and what changes about cost? _(Tests whether the candidate can swap the aggregation (PERCENTILE_CONT or NTILE-based approximations) and reason about whether the same `gaps` CTE is reusable.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/fastest_page_view_to_click)
- [SQL Interview Questions](https://datadriven.io/sql-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.