# Average Watch Time by Format

> Which content format keeps viewers watching the longest

Canonical URL: <https://datadriven.io/problems/average_watch_time_by_format>

Domain: SQL · Difficulty: medium · Seniority: L3

## Problem

The content team is deciding which formats to invest in for next quarter. For each content type, show the average watch time across all views so they can see which formats hold attention.

## Worked solution and explanation

### Why this problem exists in real interviews

Averaging a fact-table metric grouped by a dimension attribute is the single most common shape of interview SQL. Every analytics team lives on this pattern: average session length by platform, average order value by channel, average watch time by format. Interviewers use it to check whether you pick the right join (INNER vs LEFT), pull the metric from the correct table, and group at the right grain. Candidates trip when they grab a column from the wrong table (duration vs watch_seconds) or pick LEFT JOIN when the prompt says to ignore unmatched rows.

---

### Break down the requirements

#### Step 1: Join content_items to content_views on content_id

The dimension attribute (content_type) lives on content_items. The metric (watch_seconds) lives on content_views. An INNER JOIN keeps only content that was actually watched, which is what 'average watch time' asks for. LEFT JOIN would drag in unwatched content with NULL watch_seconds and silently bias the average down for types with a long tail of unwatched items.

#### Step 2: Group by content_type at the dimension grain

One row per content_type in the output. Grouping by the column from content_items (aliased) lines up cleanly with the dimension and avoids the trap of grouping by a synonym on the fact side.

#### Step 3: Aggregate watch_seconds with AVG

AVG(cv.watch_seconds) computes the mean at the grouped grain. AVG ignores NULLs automatically, so you do not need a COALESCE or a WHERE filter to protect it.

#### Step 4: Sort alphabetically by content_type

The spec says alphabetically by content_type, not by the metric. ORDER BY ci.content_type (ASC is the default) locks in deterministic output and matches the grader exactly.

---

### The solution

**Inner join, group by dimension, average the fact**

```sql
SELECT
    ci.content_type,
    AVG(cv.watch_seconds) AS avg_watch_time
FROM content_items ci
INNER JOIN content_views cv ON ci.content_id = cv.content_id
GROUP BY ci.content_type
ORDER BY ci.content_type
```

> **Cost Analysis**
>
> The fact table (content_views) drives cost. Hash-join on content_id is O(V + C) where V is views and C is content items. GROUP BY on a low-cardinality column (a handful of content types) is cheap and typically reduces billions of view rows to a few output rows. The optimizer will push the aggregate below the join if statistics allow, collapsing row counts early.

> **Interviewers Watch For**
>
> Whether you state the join type out loud and justify it (INNER because we only average what was watched), whether you pull the metric from content_views (not duration_seconds on content_items, which is a totally different concept), and whether you sort by the dimension the spec asked for rather than the metric. Strong candidates mention that AVG skips NULLs automatically and flag that the definition of 'average' is per-view, not per-content.

> **Common Pitfall**
>
> Averaging duration_seconds from content_items. That is the length of the content, not the time watched, and it will pass a superficial code review because the column name sounds plausible. Another classic: LEFT JOIN followed by AVG, which produces NULLs for unwatched rows. AVG skips those NULLs, so the number is correct but the row count interpretation is wrong, and INNER JOIN is the honest expression of intent.

---

## Common follow-up questions

- How would you also return the number of views per content_type so a consumer can see sample size? _(add COUNT(cv.view_id) or COUNT(*) to the SELECT; explain why COUNT(*) and COUNT(cv.view_id) differ only if cv rows could be NULL (here they cannot, given INNER JOIN).)_
- What if the product team wanted to exclude views shorter than 5 seconds as bot traffic? _(add WHERE cv.watch_seconds >= 5 before the GROUP BY; discuss whether the filter belongs in the ON clause or WHERE and why it does not matter for INNER JOIN.)_
- How would you compute a weighted average where each content_type contributes proportionally to total watch_seconds, not per-view? _(SUM(watch_seconds) / SUM(duration_seconds), or a window function over total watch time; show the distinction between simple and weighted averages.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/average_watch_time_by_format)
- [SQL Interview Questions](https://datadriven.io/sql-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.