# The Ones Who Finish

> Plenty of members press play. Far fewer reach the final rep. Find the formats that hold them.

Canonical URL: <https://datadriven.io/problems/the-ones-who-finish>

Domain: SQL · Difficulty: medium · Seniority: mid

## Problem

A fitness streaming app measures engagement by how far members get through each clip before dropping off, comparing time watched against the clip's full length. For views logged this year, find the average completion and the number of views per content format, ignoring items with no recorded length, with the most absorbing format first.

## Worked solution and explanation

### Why this problem exists in real interviews

This looks like an engagement report, but it is really a grain-and-ratio question wearing a fitness costume. The metric lives in content_views (watch_seconds) while the denominator lives in content_items (duration_seconds), so you cannot compute completion until you join, and once joined the only honest unit is one ratio per view. The trap is what 'average completion' means: do you average the per-view ratios, or do you divide total watch by total length? Those are different numbers, and picking the wrong one silently lets a few long clips dominate the format's score.

---

### Break down the requirements

#### Step 1: Attach each view to its clip

watch_seconds has no meaning without the clip length next to it, so join content_views to content_items on content_id. After the join every row is one view that knows both how long the member watched and how long the clip runs.

#### Step 2: Scope to this year and guard the denominator

strftime('%Y', cv.viewed_at) = '2026' keeps the current era. duration_seconds > 0 is the 'ignoring items with no recorded length' clause made literal: an article or a zero-length item would make watch_seconds / duration_seconds divide by zero and surface as a stray null format, and dropping those rows is the correct reading since a clip with no length has no completion to measure.

#### Step 3: Average the per-view ratio, by format, and count the views

GROUP BY content_type, then AVG(watch_seconds * 1.0 / duration_seconds) for the metric and COUNT(*) for how many views backed it. The * 1.0 forces real division in SQLite; without it integer math floors most ratios to 0. Order by that average descending so the format that holds attention sits on top.

---

### The solution

**Average completion by format**

```sql
SELECT ci.content_type AS content_format,
       ROUND(AVG(cv.watch_seconds * 1.0 / ci.duration_seconds), 3) AS avg_completion_rate,
       COUNT(*) AS view_count
FROM content_views cv
JOIN content_items ci ON cv.content_id = ci.content_id
WHERE strftime('%Y', cv.viewed_at) = '2026'
  AND ci.duration_seconds > 0
GROUP BY ci.content_type
ORDER BY avg_completion_rate DESC
```

> **Average of ratios vs ratio of averages**
>
> AVG(watch/duration) weights every view equally: a 10 second short watched fully counts as much as a 60 minute livestream watched fully. SUM(watch)/SUM(duration) instead weights by seconds, so a handful of long sessions can swamp the format. Both are defensible, but they answer different questions. State which one you are computing before you type, because the interviewer is listening for exactly that sentence.

> **Common pitfall**
>
> Writing AVG(watch_seconds / duration_seconds) without the * 1.0 (or a CAST). In SQLite both columns are integers, so the division truncates to 0 for nearly every partial view and your averages collapse toward zero while looking plausible. The other classic miss is leaving duration_seconds > 0 out, which divides by zero on the first zero-length item and leaks a null-completion format into the result.

> **Interviewers watch for**
>
> Whether you notice completion can exceed 1.0 when a member rewatches a clip (watch_seconds > duration_seconds), and whether you decide to cap it with MIN(ratio, 1.0) or leave it raw. Naming that edge, even if you choose to leave it uncapped, signals you understand the metric rather than just translating words into a query.

> **Cost analysis**
>
> At streaming scale content_views is the billion-row fact (say 2B rows, partitioned by viewed_at) and content_items is a small dimension (low millions). The year predicate prunes partitions before the join, and the join is a hash build on the tiny dimension side, so the plan stays a single pruned scan plus a broadcast hash. The aggregation is a streaming group-by on a five-value key. No window, no self-join.

---

## Common follow-up questions

- Rewrite this as total watch time over total length per format. When would a product team prefer that number over the average of ratios? _(Tests whether they can express the ratio-of-averages form and reason about weighting by duration vs by view.)_
- Only count a view as a 'finish' if completion reaches at least 90 percent, then report the finish rate per format. _(Pushes toward a CASE-based 0/1 collapse: AVG(CASE WHEN watch/duration >= 0.9 THEN 1 ELSE 0 END).)_
- How would you keep this incremental over a rolling 28 day window on a 2B row fact? _(Surfaces partition pruning on viewed_at and pushing the date predicate onto the base scan.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/the-ones-who-finish)
- [SQL Interview Questions](https://datadriven.io/sql-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.