# The Ones Who Hold Attention

> Plenty of creators get the click. Find the ones who actually keep people watching.

Canonical URL: <https://datadriven.io/problems/the-ones-who-hold-attention>

Domain: SQL · Difficulty: medium · Seniority: mid

## Problem

The content team has noticed that raw view counts are flattering the wrong people: a creator can rack up opens and still lose everyone in the first ten seconds, while a quieter creator keeps people glued to the end. Leadership wants to reward the second kind. For every creator who has had at least one of their pieces actually watched, work out how completely their audience tends to make it through a piece, on average, treating each view's progress as the fraction of the piece's length that was watched (watch_seconds divided by that piece's duration_seconds), and only counting pieces whose duration_seconds is greater than zero. Return three columns per creator: creator_id, how many views fed into that figure (total_views), and the average completion fraction rounded to 4 decimal places (avg_completion_ratio). Present the creators in order of who holds attention best, with the most engaging at the top; break ties by creator_id ascending.

## Worked solution and explanation

### Why this problem exists in real interviews

Engagement work almost always starts by separating a vanity metric (a view, an open, a click) from a quality metric (did the person actually stay). Here the quality signal lives across two tables: content_items holds the length of each piece in duration_seconds and who made it in creator_id, while content_views holds how long each viewer actually stayed in watch_seconds. The candidate has to recognize that 'completeness' is a ratio that must be computed per view before it can be averaged per creator, and that the two tables only line up through content_id. Interviewers ask this because it forces a correct join grain plus a per-row derived value plus an aggregate, which is exactly where people conflate 'average of ratios' with 'ratio of averages'.

---

### Break down the requirements

#### Step 1: Tie each view to its piece

A view in content_views only knows its content_id; the piece's length and author live in content_items. Match the two on content_id so each view row gains its piece's duration_seconds and creator_id. An inner join is correct here because a view with no matching piece (and a piece with no views) contributes nothing to an attention metric.

#### Step 2: Turn each view into a completion fraction

Per view, completion is watch_seconds divided by duration_seconds. Force real division (multiply by 1.0) so integer columns do not truncate to zero, and guard against duration_seconds being zero in the WHERE clause so you never divide by zero.

#### Step 3: Roll up per creator and rank

Collapse to one row per creator_id, averaging the per-view completion fractions and counting the views that fed the average, then round the average to 4 decimal places. Present highest completion first; a secondary order on creator_id makes ties deterministic for grading.

---

### The solution

**Average completion per creator**

```sql
SELECT ci.creator_id, COUNT(cv.view_id) AS total_views, ROUND(AVG(cv.watch_seconds * 1.0 / ci.duration_seconds), 4) AS avg_completion_ratio FROM content_items ci JOIN content_views cv ON cv.content_id = ci.content_id WHERE ci.duration_seconds > 0 GROUP BY ci.creator_id ORDER BY avg_completion_ratio DESC, ci.creator_id;
```

> **Cost Analysis**
>
> At platform scale content_views is the giant table (think 4B view rows / ~600 GB, partitioned by viewed_at day) while content_items is small (tens of millions). The optimizer will broadcast or hash the small content_items side keyed on content_id and stream the large views side, so the join stays a single hash-probe pass and the per-creator aggregate is a hash aggregate with cardinality bounded by the number of creators, not views. The only real cost is the full scan of views; pruning by viewed_at when leadership only cares about a recent window turns this from a full-table scan into a partition scan.

> **Interviewers Watch For**
>
> The strong signal is computing the ratio at the view grain and then averaging, rather than SUM(watch_seconds)/SUM(duration_seconds), and being able to articulate that these answer different business questions (typical viewer experience vs. aggregate throughput). Naming the zero-duration guard and the 1.0 cast unprompted also reads as someone who has been burned by silent integer truncation in production.

> **Common Pitfall**
>
> The most common mistake is writing AVG(watch_seconds) / AVG(duration_seconds), which is a ratio of averages and quietly over-weights long pieces, giving a different (and wrong) answer than the average completion the prose asks for. The second is integer division: watch_seconds / duration_seconds without the 1.0 cast floors most ratios to 0 and every creator ties at zero.

---

## Common follow-up questions

- How would you exclude creators whose ranking is driven by a handful of views, say only keeping creators with enough views to be trustworthy? _(Tests moving a count condition into HAVING and reasoning about statistical reliability of small samples.)_
- If watch_seconds can exceed duration_seconds because of replays, how would you cap completion at 100% before averaging? _(Tests handling dirty data with a CASE or MIN expression at the per-row grain rather than after aggregation.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/the-ones-who-hold-attention)
- [SQL Interview Questions](https://datadriven.io/sql-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.