# Session Page View Distance

> Page view distance per session.

Canonical URL: <https://datadriven.io/problems/session_page_view_distance>

Domain: SQL · Difficulty: hard · Seniority: L5

## Problem

The UX research team is studying engagement depth within sessions. For each session, identify the first and last page view by view_id, then compute the distance as the last view's dur_ms minus the first view's dur_ms. Discard any session that contains only a single view. Return the average distance across all qualifying sessions. Sessions are linked to page views through user_id.

## Worked solution and explanation

### Why this problem exists in real interviews

This problem targets ROW_NUMBER for sequential numbering across the `page_views` and `user_sessions` tables. You need to work with columns like `view_id`, `user_id`, and `dur_ms` to satisfy the requirements.

> **Trick to Solving**
>
> When the prompt asks for pairs or combinations within the same table, a self-join is the pattern.
> 
> 1. Alias the table twice (e.g., `a` and `b`)
> 2. Join on the shared attribute (region, user, etc.)
> 3. Add `a.id < b.id` to avoid duplicate and self-pairs

---

### Break down the requirements

#### Step 1: Structure the logic with 2 CTEs

Break the problem into named stages: `page_ranked`, `session_dist`. Each CTE isolates one transformation, making the query readable and debuggable.

#### Step 2: Self-join the table to pair rows

Join `page_views` to itself to compare rows within the same table. The join condition controls which pairs are valid and prevents duplicate mirrors.

#### Step 3: Apply the range filter

The WHERE clause restricts rows to the target range. Applying this filter early reduces the volume flowing into downstream operations.

---

### The solution

**Row-number for session page view distance**

```sql
WITH page_ranked AS (
  SELECT pv.user_id, us.session_id, pv.dur_ms, pv.view_id,
         ROW_NUMBER() OVER (PARTITION BY us.session_id ORDER BY pv.view_id ASC) AS rn_asc,
         ROW_NUMBER() OVER (PARTITION BY us.session_id ORDER BY pv.view_id DESC) AS rn_desc,
         COUNT(*) OVER (PARTITION BY us.session_id) AS view_count
  FROM page_views pv
  INNER JOIN user_sessions us ON pv.user_id = us.user_id
),
session_dist AS (
  SELECT f.session_id,
         l.dur_ms - f.dur_ms AS distance
  FROM page_ranked f
  INNER JOIN page_ranked l ON f.session_id = l.session_id
  WHERE f.rn_asc = 1 AND l.rn_desc = 1
  AND f.view_count > 1
)
SELECT AVG(distance) AS avg_distance FROM session_dist
```

> **Cost Analysis**
>
> With ~680M rows, the window function runs on the reduced set after filtering and grouping; the join cost depends on the smaller table's cardinality; CTEs materialize intermediate results, which can be beneficial or costly depending on the engine. An index on the filter/join columns would reduce the scan to a seek.

> **Interviewers Watch For**
>
> Interviewers watch for whether you decompose the problem into named, testable stages rather than nesting everything; how you prevent duplicate pairs and ensure the join condition is correct; whether you use a subquery or self-join, and can explain the tradeoffs.

> **Common Pitfall**
>
> A self-join without an inequality condition (`a.id < b.id`) produces duplicate mirrored pairs and a self-pair row for each entry.

---

## Common follow-up questions

- The `referrer` column in `page_views` has a 35% null rate. How does your query handle rows where `referrer` is NULL, and could that silently change the result count? _(Tests whether the candidate accounts for NULLs in `page_views.referrer` and understands how aggregates skip NULL values.)_
- If `page_views` grew to contain billions of rows, which part of your query would become the bottleneck given the cardinality of `dur_ms`? _(Tests ability to identify performance hotspots related to `page_views.dur_ms` at scale.)_
- Would materializing the CTE as a temp table improve performance when joining `page_views` and `user_sessions`? _(Tests understanding of CTE materialization behavior across different database engines.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/session_page_view_distance)
- [SQL Interview Questions](https://datadriven.io/sql-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.