# High-Output Creators

> High engagement creators.

Canonical URL: <https://datadriven.io/problems/high_output_creators>

Domain: SQL · Difficulty: easy · Seniority: L4

## Problem

The editorial team is evaluating high-volume creators who have produced at least an hour of total content (3,600 seconds). For each qualifying creator, show their average duration per item, sorted from longest average to shortest.

## Worked solution and explanation

### Why this problem exists in real interviews

Two aggregates over the same group, used for different purposes: `SUM` gates who qualifies (an hour of total runtime) and `AVG` shapes the output and the sort. Mixing the two up, filtering on `AVG` or computing `SUM` in the `SELECT`, is the failure mode. The HAVING clause is the natural home for the gate.

---

### Break down the requirements

#### Step 1: Group by creator

Every metric lives at the creator grain. `GROUP BY creator_id` is the only sensible key because `content_items` is one row per piece of content, not per creator.

#### Step 2: Gate on total runtime, not per-item

`HAVING SUM(duration_seconds) >= 3600` filters the groups. Putting the threshold in `WHERE` would filter individual items, which is not what an hour of total content means.

#### Step 3: Project the average and sort

`AVG(duration_seconds)` is independent of the gate. `ORDER BY avg_duration DESC` puts the long-form creators at the top, which is what an editorial team would scan first.

---

### The solution

**HIGH OUTPUT CREATORS**

```sql
SELECT creator_id,
       AVG(duration_seconds) AS avg_duration
FROM content_items
GROUP BY creator_id
HAVING SUM(duration_seconds) >= 3600
ORDER BY avg_duration DESC
```

> **Cost Analysis**
>
> Full scan of 5M rows, hashed by `creator_id`. There is no predicate to push down. A pre-aggregated `creator_daily` rollup would shrink the scan dramatically if this query runs often, but for an ad hoc editorial review the raw scan is fine.

> **Interviewers Watch For**
>
> Whether you reach for `WHERE duration_seconds >= 3600` and then average. That filters long items, not prolific creators. Articulating that `HAVING` operates on group-level aggregates while `WHERE` operates on rows is the differentiator here.

> **Common Pitfall**
>
> `SUM` and `AVG` both ignore `NULL` rows. A creator with one 4000-second video and one row where `duration_seconds` is `NULL` will show `avg_duration = 4000`, not 2000. Decide with the editorial team whether nulls are bugs, drafts, or deliberate.

---

### COMMON FOLLOW-UP QUESTIONS

## Common follow-up questions

- Restrict this to creators whose first publish was in the last 12 months. _(Tests whether you add a `MIN(publish_date)` condition to `HAVING` or filter inside a subquery.)_
- Return the median duration per creator instead of the average. _(Probes knowledge of `PERCENTILE_CONT` or `PERCENTILE_DISC` and how engines differ on syntax.)_
- Now break the average out by `content_type` within each qualifying creator. _(Checks comfort with multi-key grouping and whether the hour threshold stays at the creator level or moves to the type level.)_

> **Why editorial cares**
>
> An hour of cumulative runtime is the rough cutoff where a creator graduates from one-off contributor to recurring talent. The average duration then tells the team whether to slot them into short clips, mid-form explainers, or long features.

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/high_output_creators)
- [SQL Interview Questions](https://datadriven.io/sql-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.