# Listens From Everywhere, Counted Once

> Phones, tablets, laptops. And some of them report late.

Canonical URL: <https://datadriven.io/problems/listens_from_everywhere_counted_once>

Domain: Pipeline Design · Difficulty: hard · Seniority: L7

## Problem

Our users stream music on phones, desktop apps, smart speakers, and web browsers. We want to show artists and labels near-real-time stats on how their tracks are performing, and we need a historical record for royalty calculations. Design a data pipeline that collects listening events from all our device types.

## Worked solution and explanation

### Why this problem exists in real interviews

Two consumers with opposite correctness budgets , the artist dashboard wants minutes-fresh and approximate is fine; royalty calculation wants exact-once on event-time, retried, and deletable on demand. The trap is one stream that satisfies neither, or two pipelines that drift, or a royalty path that uses arrival time and over-pays for events that buffered offline for hours.

The default reach is one stream that aggregates listens into a counts table both the dashboard and royalties read. The dashboard updates within minutes, royalties run off the same table at month-end. Mobile clients buffer offline and dump events on reconnect; the royalty count for the prior month inflates because the buffered events landed in this month's aggregation. A user requests deletion and the team realizes the aggregations don't carry the user-id, so deletion can only happen at the raw layer and the aggregates stay polluted.

> **Trick to Solving**
>
> Two paths off one durable archive, event-time everywhere, dedup on a stable id for royalties, deletion that propagates through aggregates.
> 
> 1. All listens land in a durable raw archive partitioned by event-time on the device. Both the dashboard and royalties read derivatives of the archive; neither is authoritative on its own.
> 2. The dashboard runs a streaming aggregator on the same source for near-real-time counts; approximate is fine.
> 3. Royalty calculation runs nightly off the archive with dedup on a stable listen id, attributing each listen to its event-time. Buffered offline listens land in the right month.
> 4. Deletion is an event on the same path as the listens; it propagates through the archive and triggers a recompute of the affected aggregations.

---

### Walk the requirements

#### Step 1: Two paths off one archive, sized for two correctness budgets

Listens land in a durable raw archive partitioned by event-time. The dashboard runs a streaming aggregator off the same source and updates within minutes; royalty calculation runs nightly against the archive. The two paths share a source of truth (the archive) and diverge in correctness budget: approximate is fine for the dashboard, exactness is the budget for royalties. Forcing royalty calculation to read from the dashboard's aggregate is the version where royalty payouts inherit the dashboard's approximations; forcing the dashboard to wait for royalty exactness is the version that misses the freshness target.

#### Step 2: Royalty payouts exact, dedup on a stable listen id

Each qualifying listen carries a stable listen id from the client. Royalty calculation reads the archive, dedups on the id, applies the qualifying-listen rule, and counts each id once. Idempotent computation keyed on (artist, track, period) means a re-run produces the same payout. A retried event, a duplicate from a buggy client, or a late-arriving offline buffer all collapse to one count because the id is the contract. Counting before dedup, or trusting the streaming aggregate, is the version where royalty payouts shift between recalculations.

#### Step 3: Attribute each event to its on-device time, not arrival time

Mobile and smart-speaker clients buffer offline and replay hours later. Each event carries the device's event-time; the archive partitions on event-time, and royalties attribute each listen to its event-time month. A listen that happened in March and landed in April still counts toward March's payout. Attributing on arrival time inflates the month the buffer drained into and short-changes the month the listen actually happened; the artist's payout for March drops and they ask why.

#### Step 4: Deletion propagates from the archive through the aggregates

A deletion request enters as an event on the same path as the listens. The archive removes (or tombstones) the user's events for the requested window; the aggregations that included those events recompute for the affected partitions; the streaming dashboard's running aggregate is corrected on the next refresh window. Without propagation, the archive is clean but the royalty payouts and dashboard counts still reflect the deleted user's listens, which is the audit failure the requirement names.

---

### The shape that fits

> **What this design gives up**
>
> Two paths off one archive doubles the per-event compute (streaming aggregate plus archive write). Event-time partitioning means late buffer drains trigger small recomputes of older partitions. Deletion that propagates through aggregations adds a control plane that has to track recomputes. Operational cost is the price; the win is artist dashboards that feel live, royalty payouts that don't shift between runs, listens that count for the month they happened, and deletion that the audit can verify.

> **What reviewers check**
>
> A reviewer looks at the canvas for these properties:
> - All listens land in a durable raw archive in cold storage, partitioned by event-time.
> - A streaming path serves the artist dashboard within minutes; a batch path serves royalty calculation off the archive nightly.
> - Royalty calculation dedups on a stable listen id and attributes each event to its on-device timestamp.
> - A deletion control plane removes user events from the archive and recomputes affected aggregates.

> **The mistake that ships**
>
> What gets shipped runs one streaming aggregator into a counts table and tells royalty to read from it nightly. Mobile buffer drains push last month's listens into this month's count; royalty payouts inflate for the wrong month and an artist asks why their March drop matched a popular track and their April spike didn't. A user requests deletion; the team finds the listens in the raw topic but the aggregates carry no user id, so deletion goes through at the source while the warehouse aggregates keep including the events. The eventual rebuild is event-time partitioning, dedup on listen id, and a deletion control plane that recomputes affected aggregates.

---

## Common follow-up questions

- A buggy client sends every listen twice for a week. What in this design protects royalty payouts, and what does the artist dashboard show during that week? _(Tests whether the candidate sees that royalty dedup on listen id absorbs the duplicates (the payout is right), but the dashboard's streaming aggregator may show inflated counts until the dedup contract is enforced upstream. The two paths can drift on dashboard accuracy without affecting royalty correctness.)_
- Royalty rules change retroactively (a track's qualifying-listen threshold is adjusted). How does this design recompute past payouts, and what doesn't change? _(Tests whether the candidate sees the archive as immutable history, the royalty batch as the place where rules live, and rule changes as a re-run of the batch over the affected periods writing partition-overwrites to the royalty warehouse. The archive doesn't change; the dashboard doesn't care; the payouts update.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/listens_from_everywhere_counted_once)
- [System Design Interview Questions](https://datadriven.io/data-engineering-system-design)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.