# Content Session Counts

> Session metrics, content item by item.

Canonical URL: <https://datadriven.io/problems/content_session_counts>

Domain: SQL · Difficulty: medium · Seniority: L3

## Problem

The content team wants to understand which pages drive the most session engagement. For each page URL, count how many unique sessions viewed it (deduplicated per page). Show all pages with their unique session count, sorted from most viewed to least.

## Worked solution and explanation

### Why this problem exists in real interviews

This tests JOIN with COUNT(DISTINCT) for deduplication. The challenge is associating page views with sessions and counting unique sessions per page, which requires understanding the relationship between views, users, and sessions.

---

### Break down the requirements

#### Step 1: Join page views to sessions

Join `page_views` to `user_sessions` on `user_id` to associate each view with a session.

#### Step 2: Deduplicate per page

`COUNT(DISTINCT session_id)` per `page_url` gives unique sessions, not total views.

#### Step 3: Sort by session count

`ORDER BY unique_sessions DESC` surfaces the most-viewed pages first.

---

### The solution

**Session-level deduplication per page**

```sql
SELECT
    pv.page_url,
    COUNT(DISTINCT us.session_id) AS unique_sessions
FROM page_views pv
JOIN user_sessions us ON pv.user_id = us.user_id
GROUP BY pv.page_url
ORDER BY unique_sessions DESC
```

> **Cost Analysis**
>
> Hash join of 400M page views to 80M sessions on user_id. The COUNT(DISTINCT) requires maintaining hash sets per page_url group. With high-cardinality pages, this can be memory-intensive.

> **Interviewers Watch For**
>
> Whether the join condition correctly maps views to sessions. A simple user_id join may over-count if a user has multiple sessions. Ideally, you would also filter by time overlap, but the schema may not support it.

> **Common Pitfall**
>
> Joining only on user_id without considering session timing means a page view could be attributed to any of the user's sessions. If the schema lacks a session_id on page_views, this is the best available approximation.

---

## Common follow-up questions

- What if page_views had a session_id column? _(Direct join on session_id would be more accurate and eliminate the user_id ambiguity.)_
- How would you also show the average time spent per page? _(Add AVG(dur_ms) to the SELECT.)_
- What if you needed sessionized page views based on 30-minute inactivity gaps? _(Tests session window computation using LAG and cumulative sums.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/content_session_counts)
- [SQL Interview Questions](https://datadriven.io/sql-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.