# The Session Stitcher

> Page views without sessions are just noise.

Canonical URL: <https://datadriven.io/problems/the_session_stitcher>

Domain: SQL · Difficulty: hard · Seniority: L5

## Problem

Our event pipeline doesn't track sessions natively. Given a stream of page-view events with user_id and timestamp, define a session as a sequence where no two consecutive events are more than 30 minutes apart. Assign a session_id to each event, then compute the average session duration and average pages per session.

## Worked solution and explanation

### Why this problem exists in real interviews

Product analytics teams stitch raw event streams into sessions every day, and a 30-minute inactivity boundary is the canonical heuristic. Interviewers use this prompt to see whether you can chain LAG, a gap flag, and a running SUM into session ids without a self-join.

---

### Break down the requirements

#### Step 1: Compute the gap to the previous view per user

LAG(viewed_at) OVER (PARTITION BY user_id ORDER BY viewed_at) gives prev_time. PARTITION BY user_id is critical so users do not bleed into each other's sessions.

#### Step 2: Flag a new session when gap exceeds 30 minutes

(julianday(viewed_at) - julianday(prev_time)) * 24 * 60 converts the day-fraction into minutes. The flag is 1 when prev_time IS NULL (very first view) OR the minute gap is greater than 30. Use julianday in SQLite, not EXTRACT or DATEDIFF.

#### Step 3: Build session_id with running SUM of the flag

SUM(new_session) OVER (PARTITION BY user_id ORDER BY viewed_at) increments the session id every time a new session starts. Then GROUP BY (user_id, session_id) and compute MIN, MAX, and COUNT to get duration and page count.

---

### The solution

**LAG, gap flag, running SUM session id, then session-level aggregates**

```sql
WITH with_gap AS (
  SELECT user_id, viewed_at, LAG(viewed_at) OVER (PARTITION BY user_id ORDER BY viewed_at) AS prev_time
  FROM page_views
),
with_flag AS (
  SELECT user_id, viewed_at,
    CASE WHEN prev_time IS NULL OR (julianday(viewed_at) - julianday(prev_time)) * 24 * 60 > 30 THEN 1 ELSE 0 END AS new_session
  FROM with_gap
),
with_session AS (
  SELECT user_id, viewed_at,
    SUM(new_session) OVER (PARTITION BY user_id ORDER BY viewed_at) AS session_id
  FROM with_flag
),
sessions AS (
  SELECT user_id, session_id, MIN(viewed_at) AS s_start, MAX(viewed_at) AS s_end, COUNT(*) AS page_count
  FROM with_session GROUP BY user_id, session_id
)
SELECT AVG((julianday(s_end) - julianday(s_start)) * 1440) AS avg_session_duration,
       AVG(page_count) AS avg_pages_per_session
FROM sessions
```

> **Cost Analysis**
>
> page_views has 500M rows partitioned across 10M users (zipf, so a few users have huge fan-out). The two windows both need (user_id, viewed_at) order; a covering index on (user_id, viewed_at) lets the planner stream rows pre-sorted. Memory is bounded per partition because the windows are stateful only within one user.

> **Interviewers Watch For**
>
> Did you PARTITION BY user_id on both windows, use julianday math (not strftime subtraction), and convert to minutes correctly (multiply by 1440 or by 24 * 60)? Candidates who write julianday(...) - julianday(...) and forget the conversion produce sessions measured in days.

> **Common Pitfall**
>
> Forgetting PARTITION BY on the running SUM means the session id increments globally; a quiet user followed by a chatty one would inherit the chatty user's session count. The PARTITION BY user_id on the SUM window is what keeps sessions per-user.

---

## Common follow-up questions

- How would you make the gap threshold configurable? _(Replace the literal 30 with a parameter. In productionized analytics, 30 minutes is convention but mobile vs desktop often use different thresholds.)_
- What if you needed session_id to be globally unique, not per-user? _(Concatenate user_id || '-' || session_id in the final SELECT, or use ROW_NUMBER over the distinct (user_id, session_id) tuples.)_
- How would you make this incremental for a streaming pipeline? _(Persist the last seen viewed_at per user; on new events, compare to that timestamp to decide whether to extend the open session or start a new one. The batch SQL is the offline equivalent.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/the_session_stitcher)
- [SQL Interview Questions](https://datadriven.io/sql-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.