# The Same Stream Twice

Canonical URL: <https://datadriven.io/problems/the-same-stream-twice>

Domain: Pipeline Design · Difficulty: hard · Seniority: senior

## Problem

A global streaming-video platform collects about 2 billion playback heartbeat events a day from 150 million subscribers, and two teams read the same feed: reliability needs rebuffering spikes per title and region surfaced within seconds so on-call can be paged, while finance pays studios royalties on exact minutes-watched and cannot tolerate a single double-counted or dropped event. Design the pipeline so both teams consume the same durable ingest independently, with the live alerting path running approximate and fast while the daily royalty report counts each raw event exactly once instead of reusing the live aggregation. Keep a malformed heartbeat from a bad device build from stalling the live path.

## Worked solution and explanation

### Why this problem exists in real interviews

This is two consumers with opposite correctness budgets reading one event stream, dressed up as a streaming dashboard. Reliability wants approximate-and-instant: a rebuffering spike on a title in seconds, off-by-a-few is fine. Finance wants exact-and-patient: royalty payments to studios on minutes-watched, where one double-counted retry is a contractual problem. The trap is the single aggregation that serves both: it is too approximate to bill on and, once you bolt exactness onto it, too slow to page on-call.

The whiteboard answer is one stream that writes a 'minutes_watched' table the QoE dashboard and the royalty report both read. It looks elegant for a week. Then a producer retries during a deploy, the live count and the royalty count both inflate, and finance overpays a studio. Or a malformed heartbeat from a bad firmware build crashes the consumer, it replays the offset, and reliability goes blind for hours. Three requirements failed quietly because they were never separated.

> **Trick to Solving**
>
> One durable ingest, then two paths sized for two correctness budgets.
> 
> 1. The live QoE path is streaming and approximate: detect per-title, per-region rebuffering in seconds and page on-call.
> 2. The royalty path is a daily batch that reads the same queue directly and counts exactly once, deduped on a stable event_id, never reusing the live aggregation.
> 3. Both consume the same queue from their own offsets, and bad events go to a dead-letter quarantine so the live path never stalls.

---

### Walk the requirements

#### Step 1: Land everything in one durable queue

Heartbeats hit a partitioned message queue first. Two consumers (the stream processor and the daily batch) read from their own committed offsets on that same queue, so neither can starve the other and either can replay independently. The subtle failure is chaining the royalty batch off the stream processor's output instead of off the raw queue: now royalties inherit the live path's approximations and lose their own replay. Two separate collection paths drift just as badly, and the moment they disagree nobody can say which number is right.

#### Step 2: Stream the QoE path; approximate is the budget

A stream processor windows heartbeats by title and region and flags rebuffering above a baseline, emitting to an alert destination inside 30 seconds. Counting a viewer in transition slightly wrong is acceptable here; the live-feel is what matters. The 'compute it exactly before we alert' version is the one where on-call learns about the incident from Twitter.

#### Step 3: Make royalty exactly-once at the sink, not in the stream

The daily batch reads the raw heartbeats straight from the queue, where each event carries a stable event_id from the player. It dedups on that id and writes idempotently to the warehouse, so a producer retry is absorbed and a window-edge event is not dropped. Exactly-once is a property you enforce at the sink with a key, not a hope you pin on the streaming layer. Billing from the approximate live table is the mistake that becomes an overpayment.

#### Step 4: Quarantine poison pills, keep draining

Validation lives on the consumer: a heartbeat that fails schema (the missing title_id from the bad firmware build) routes to a dead-letter queue and the source is acked so the main path keeps moving. A separate process triages the DLQ, and DLQ depth is alerted on so a firmware regression is visible. If a bad event can halt the consumer, one device build can blind reliability for hours.

---

### The shape that fits

> **Scale + Cost**
>
> At 2B events/day with 5x tentpole spikes, the live path is the expensive one: it has to keep up in real time, so partition the queue on a high-cardinality key and autoscale the stream processor on consumer lag. The royalty batch is cheap and decoupled, running once a day off its own committed offsets, so an evening spike never competes with finance's close. Streaming everything would multiply the compute bill for no business gain on the five batch-appropriate metrics.

> **Interviewers Watch For**
>
> The tell of seniority is naming the per-consumer correctness budget out loud: approximate-and-live for QoE, exactly-once for royalties, and refusing to bill from the live aggregation. The other tell is wiring the royalty batch to the raw queue, not to the stream processor's output. Strong candidates also raise the dead-letter path unprompted and ask about late-arriving heartbeats at the daily close.

> **Common Pitfall**
>
> Chaining the royalty batch off the stream processor instead of the raw queue, so royalties silently inherit the live path's approximations and lose independent replay. The cousin mistake is streaming everything because real-time feels safer, then bolting a dedup onto the live table to make royalties exact: it slows the live path below the 30-second SLA and still under-counts at window edges. The third classic miss is no DLQ: the first malformed firmware heartbeat halts the consumer, it replays, and reliability is blind during the incident it most needed to see.

---

## Common follow-up questions

- A heartbeat arrives two hours after its day has already closed in the royalty warehouse. What does the batch do with it? _(Tests bounded-lateness watermarks plus an event_id-keyed correction run that lands the late event in the next day without double-counting.)_
- A tentpole release pushes the live path to 5x volume and consumer lag starts climbing. What keeps on-call from going blind? _(Tests partition strategy, autoscaling on lag, and the decoupling that stops the royalty batch from stealing resources from the live path.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/the-same-stream-twice)
- [System Design Interview Questions](https://datadriven.io/data-engineering-system-design)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.