# Where the Hours Go

Canonical URL: <https://datadriven.io/problems/where-the-hours-go>

Domain: PySpark · Difficulty: medium · Seniority: mid

## Problem

We run a streaming platform and want to know where our viewers actually spend their time, counting the seconds of every session they open. Break that total down by the kind of device they watch on, most time first.

## Worked solution and explanation

### Why this problem exists in real interviews

This is a fact-to-dimension join followed by a per-device rollup, dressed up as an engagement question. The session seconds live in the giant sessions fact table; the device type lives in the small devices dimension. Anyone can write the join and the groupBy. The thing an interviewer is actually watching for is whether you notice you are about to trigger TWO shuffles when the join only needs one, because you reached for a default hash join instead of broadcasting the tiny side. Get that wrong on a 40M-row sessions table and you shuffle the whole thing across the network before you have even started aggregating.

---

### Break down the requirements

#### Step 1: Attach each session to its device

session_duration_sec sits in user_sessions, device_type sits in devices. Join on device_id. devices is the small dimension, so this join should be a broadcast, not a shuffle of the fact table.

#### Step 2: Total the seconds per device type

groupBy('device_type') then sum('session_duration_sec'). This is the one wide transformation you genuinely cannot avoid: the sum has to see every session for a device type, so those rows must land on the same partition.

#### Step 3: Order by the total, biggest first

orderBy the aggregated column descending. The sort runs on the aggregated rows (one per device type), which is a tiny dataset, so it is cheap.

---

### The solution

**Total session time per device type**

```python
from pyspark.sql import functions as F

result = (
    user_sessions
    .join(F.broadcast(devices), on='device_id', how='inner')
    .groupBy('device_type')
    .agg(F.sum('session_duration_sec').alias('total_session_seconds'))
    .orderBy(F.col('total_session_seconds').desc())
)
```

> **Cost Analysis**
>
> Broadcasting devices ships the dimension to every executor once, so the 40M-row sessions table never moves for the join: the join becomes a narrow, map-side operation. Only the groupBy forces a shuffle, and it shuffles partial sums, not raw rows, because Spark does a map-side pre-aggregation first. One shuffle instead of two, and the bytes crossing the wire are a fraction of the fact table.

> **Interviewers Watch For**
>
> State out loud which side you broadcast and why (the device dimension is small enough to fit in each executor's memory). Mention that the join is narrow and the groupBy is the only wide stage. If asked about coalesce vs repartition here: after aggregating to a handful of device rows you want coalesce to collapse empty output partitions without another shuffle, never repartition.

> **Common Pitfall**
>
> Letting Spark default to a sort-merge join shuffles both sides on device_id before the aggregation even begins, doubling the wide stages. The second classic miss is skew: if one device type (say mobile) owns most sessions, millions of rows land on a single reduce task and stall the whole job on one straggler executor while the rest sit idle.

---

## Common follow-up questions

- The dimension you broadcast grows past what fits in executor memory. How does your plan change? _(Tests whether they know broadcast has a size ceiling and can fall back to a shuffle/sort-merge join with sensible partitioning.)_
- One device type accounts for 30 percent of all sessions. How do you keep that partition from stalling the job? _(Tests skew awareness: salting the key, adaptive query execution skew handling, or a two-stage aggregation.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/where-the-hours-go)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.