# High Engagement Pages

> Some pages hold attention longer than others.

Canonical URL: <https://datadriven.io/problems/high_engagement_pages>

Domain: SQL · Difficulty: hard · Seniority: L3

## Problem

Each page view tracks duration in milliseconds. For each page URL, convert the duration to seconds and sum the total viewing time across all views from users who have session records. Only return pages with more than 5 seconds of total viewing time. Return the page URL and total viewing time in seconds.

## Worked solution and explanation

### Why this problem exists in real interviews

This challenge targets grouped aggregation against `page_views`, `user_sessions`. Getting the grouping wrong on `page_url`, `referrer`, `dur_ms` produces silently incorrect counts, which is exactly the trap interviewers set.

---

### Break down the requirements

#### Step 1: Join on `user_id`

`JOIN user_sessions ON page_views.user_id = user_sessions.user_id` combines data from both tables.

#### Step 2: Aggregate per `session_id`

`GROUP BY b.session_id` with `SUM(dur_ms)` produces the desired metric.

---

### The solution

**Group-aggregate for high engagement pages**

```sql
SELECT b.session_id, SUM(a.dur_ms) AS result
FROM page_views a
JOIN user_sessions b ON a.user_id = b.user_id
GROUP BY b.session_id
ORDER BY result DESC
```

> **Cost Analysis**
>
> The join between `page_views` and `user_sessions` is the dominant cost. An index on `user_sessions(user_id)` makes the lookup efficient.

> **Interviewers Watch For**
>
> The interviewer checks that you join on `user_id` and aggregate to the correct grain.

> **Common Pitfall**
>
> A many-to-many join can inflate row counts and skew aggregates. Verify the join produces the expected cardinality.

---

## Common follow-up questions

- The `referrer` column in `page_views` has roughly 35% NULLs. How does your query handle those rows, and would the result change if NULLs were replaced with zeros? _(Tests whether the candidate understands how NULLs propagate through aggregation functions and whether their WHERE/JOIN conditions implicitly filter them out.)_
- If `page_views` and `user_sessions` have a one-to-many relationship, how does that affect the COUNT in your GROUP BY? _(Tests understanding of fan-out: joining before grouping can inflate counts.)_
- `view_id` in `page_views` has ~800M distinct values. What index strategy keeps your query from doing a full table scan? _(Tests whether the candidate can design indexes for high-cardinality columns and understands selectivity.)_
- Could you express this same logic as a single query without CTEs or subqueries? What readability trade-off does that introduce? _(Tests whether the candidate can flatten nested logic and understands when decomposition aids maintainability.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/high_engagement_pages)
- [SQL Interview Questions](https://datadriven.io/sql-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.