# Top Content by Watch Time

> Some content holds attention. Others get skipped.

Canonical URL: <https://datadriven.io/problems/top_content_by_watch_time>

Domain: SQL · Difficulty: medium · Seniority: L4

## Problem

Only consider views where watch duration is recorded. Rank all content by total watch time across all viewers and return the top 3 tiers. If multiple items share the same total, include all of them. Show the content ID, title, and total watch time.

## Worked solution and explanation

### Why this problem exists in real interviews

This probes ranking-function fluency under explicit tie semantics. 'Top 3 tiers including ties' rules out `LIMIT 3` and `ROW_NUMBER`, and the interviewer wants you to say so before writing. Picking `DENSE_RANK` over `RANK` matters too: regular `RANK` skips numbers after ties, so `rnk <= 3` returns a different set than DENSE_RANK.

---

### Break down the requirements

#### Step 1: Filter null durations

Add `WHERE cv.watch_seconds IS NOT NULL`. SQLite skips nulls in `SUM`, but stating the filter explicitly tells the interviewer you noticed `watch_seconds` is nullable and that you do not want phantom views inflating `view_count`.

#### Step 2: Aggregate per content

First CTE: `INNER JOIN content_items` to `content_views` on `content_id`, then `GROUP BY ci.content_id, ci.title` and `SUM(cv.watch_seconds)`. Carry `title` in the group key so you can return it without a second join.

#### Step 3: Rank with ties preserved

Second CTE: `DENSE_RANK() OVER (ORDER BY lifetime_value DESC)`. Dense, not regular: regular rank skips numbers after ties, so `rnk <= 3` would still cut off the third tier if rank 2 had two rows.

#### Step 4: Filter to top 3 tiers

Outer `SELECT ... WHERE rnk <= 3`. This is the only place the `<= 3` shows up. Do not also add `LIMIT 3`, you will reintroduce the bug you just avoided.

---

### The solution

**TOP CONTENT BY WATCH TIME**

```sql
WITH content_stats AS (
  SELECT
    ci.content_id,
    ci.title,
    SUM(cv.watch_seconds) AS lifetime_value,
    COUNT(*) AS view_count
  FROM content_items ci
  INNER JOIN content_views cv ON ci.content_id = cv.content_id
  WHERE cv.watch_seconds IS NOT NULL
  GROUP BY ci.content_id, ci.title
  HAVING view_count >= 1
),
ranked AS (
  SELECT
    content_id,
    title,
    lifetime_value,
    DENSE_RANK() OVER (ORDER BY lifetime_value DESC) AS rnk
  FROM content_stats
)
SELECT content_id, title, lifetime_value
FROM ranked
WHERE rnk <= 3
```

> **Cost Analysis**
>
> 1B-row `content_views` is the fact, 500K `content_items` is the dim. Planner builds a hash on `content_items.content_id` and probes from `content_views`. The aggregate is a hash group by on `content_id` (500K groups), then a sort for `DENSE_RANK`. If `content_views` is partitioned on `viewed_at`, no pruning helps here, the question is lifetime.

> **Interviewers Watch For**
>
> Say out loud: `does top 3 mean 3 rows or 3 distinct totals?` The prompt resolves it (3 tiers including ties), but asking proves you spotted the ambiguity. Then call out that `LIMIT 3` and `ROW_NUMBER` both fail this spec, and `DENSE_RANK` is the fix.

> **Common Pitfall**
>
> Writing `RANK()` instead of `DENSE_RANK()`. With two items tied at rank 1, `RANK` jumps to rank 3 for the next item. `rnk <= 3` then returns the two ties plus the 3rd item, skipping the rank-2 slot entirely. `DENSE_RANK` gives 1, 1, 2, so `rnk <= 3` covers three actual tiers. If you wanted exactly the top 3 distinct totals, only `DENSE_RANK` does that.

---

### COMMON FOLLOW-UP QUESTIONS

## Common follow-up questions

- How would you change this to top 3 per `content_type`? _(Probes whether you add `PARTITION BY ci.content_type` to the `DENSE_RANK` window and carry `content_type` through the first CTE.)_
- What if `watch_seconds` can exceed `duration_seconds` from replays, and we want capped watch time? _(Tests `LEAST(cv.watch_seconds, ci.duration_seconds)` inside the `SUM`, surfacing comfort with row-level expressions inside aggregates.)_
- How would you do this without a window function? _(Looks for a self-join or scalar subquery against the third-highest distinct `lifetime_value`. Forces you to articulate why `DENSE_RANK` is cleaner.)_
- The query is too slow. What do you change first? _(Expects you to push the `IS NOT NULL` filter and aggregate down before the join, or pre-aggregate `content_views` by `content_id` in a daily rollup table.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/top_content_by_watch_time)
- [SQL Interview Questions](https://datadriven.io/sql-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.