# Who Holds Attention

Canonical URL: <https://datadriven.io/problems/who-holds-attention>

Domain: PySpark · Difficulty: hard · Seniority: mid

## Problem

We run a short-video platform where the catalog of videos is small but views pile up fast, and we want to know which creators actually hold attention rather than just publish volume. Pair each view with its video's creator and, for each creator, report the total seconds watched and the number of views, keeping only creators whose videos drew more than 1000 total seconds of watch time. List the most-watched creators first.

## Worked solution and explanation

This is a textbook fact-into-dimension join wearing a creator-analytics costume. The view log is the fact (tens of millions of rows); the catalog is a tiny dimension that just carries content_id to creator_id. The skill being probed is whether you recognize that you should NOT shuffle the giant side. A default join keys both inputs by content_id and shuffles them across the cluster, dragging 40M view rows over the network for nothing. Broadcast the catalog instead and the fact never moves. Miss that and your one cheap join becomes the most expensive stage in the job.

---

### The two shuffles, and which one you can kill

#### Step 1: Broadcast the catalog, not the view log

content_items is small enough to fit in executor memory, so wrap it in F.broadcast(). Spark ships one copy to every executor and joins locally against whatever view partitions already live there. No shuffle on the big side. Without the hint, Spark may still pick a sort-merge join and repartition 40M rows by content_id first.

#### Step 2: Accept the groupBy shuffle, then watch it for skew

Grouping by creator_id is a wide transformation: it has to exchange rows so every creator lands on one partition. That shuffle is unavoidable here, but it is where a viral creator with millions of views melts one task while the rest idle. Knowing the difference between the join shuffle (avoidable) and the aggregation shuffle (intrinsic) is the senior tell.

#### Step 3: Filter after aggregating

The more-than-1000-seconds rule is a post-aggregation condition: it reads the summed total_watch_seconds, so it cannot run until the groupBy completes. Apply it after agg, then order by total_watch_seconds descending.

---

### The job, stage by stage

### The solution

**Per-creator watch summary**

```python
from pyspark.sql import functions as F

views = spark.table('content_views')
catalog = spark.table('content_items')

result = (
    views
    .join(F.broadcast(catalog), 'content_id')
    .groupBy('creator_id')
    .agg(
        F.sum('watch_seconds').alias('total_watch_seconds'),
        F.count('view_id').alias('view_count'),
    )
    .filter(F.col('total_watch_seconds') > 1000)
    .orderBy(F.col('total_watch_seconds').desc())
)
```

*F.broadcast() pins the small catalog; the only shuffle left is the unavoidable groupBy.*

> **Cost Analysis**
>
> At ~40M view rows against a catalog small enough to fit in executor memory, a sort-merge join would sort and shuffle both sides: two exchanges, both dominated by the 40M-row fact. The broadcast join removes the fact-side exchange entirely, so the catalog (tens of MB once projected to content_id and creator_id) is sent once per executor and the views stay on their original partitions. You go from two big shuffles to one, and the remaining one only moves the aggregated key space.

> **Interviewers Watch For**
>
> Say out loud why the catalog is the broadcast side (it is the small one), and name the autoBroadcastJoinThreshold (default 10MB) that decides whether Spark does this for you. Then volunteer that the groupBy still shuffles and could skew on a viral creator. That sentence is the difference between 'I know the join API' and 'I understand the physical plan.'

> **Common Pitfall**
>
> Broadcasting the wrong side. Calling F.broadcast(views) on the 40M-row fact tries to collect it to the driver and ship it everywhere, blowing the driver heap or silently falling back to a regular join. Broadcast is for the SMALL table only.

---

## Common follow-up questions

- One viral creator owns 30% of all views. The groupBy stage hangs on a single task. What do you do? _(Tests skew handling: salting the key, AQE skew-join/skewed-partition handling, or pre-aggregating.)_
- What happens if you join this one large view log against many small dimension tables at once? _(Tests whether they know each broadcast is cheap alone but many concurrent broadcasts pressure the driver and re-ship per stage; chaining vs. a single denormalized broadcast.)_
- The catalog grows past the broadcast threshold. How does the plan change and how would you keep it fast? _(Tests the fallback to sort-merge join and mitigations like bucketing both tables on content_id to avoid the shuffle.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/who-holds-attention)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.