# Three Providers, One Workout

> The same ride, reported three times.

Canonical URL: <https://datadriven.io/problems/three_providers_one_workout>

Domain: Pipeline Design · Difficulty: hard · Seniority: L5

## Problem

Your platform receives fitness activity events from three external providers, each using a different data format and delivery mechanism. The same workout session can appear from more than one provider simultaneously. Design a pipeline that normalizes incoming events, resolves conflicts when the same session arrives from multiple sources, and routes clean data to downstream consumers.

## Worked solution and explanation

### Why this problem exists in real interviews

The same workout from multiple providers, leaderboard latency under a minute, and PHI like heart rate and GPS that has to stay restricted with audit logging on raw reads. The trap is one stream that updates the leaderboard from every event without dedup, or putting PHI on the same path everyone reads.

The default reach is one streaming consumer that updates the leaderboard from every provider event. The same user records the same workout on a hardware device and a wearable; both events arrive and the leaderboard shows two entries. Heart rate and GPS sit on the same table the leaderboard reads; one direct query exposes PHI. Audit asks who read raw event data and the answer is nothing.

> **Trick to Solving**
>
> Canonical workout shape, dedup by user-and-workout, PHI restricted with audit logging.
> 
> 1. All three providers normalize to a canonical workout shape on the bus; downstream consumers read one schema.
> 2. The streaming consumer dedups on (user, workout window) so two providers reporting the same workout collapse to one leaderboard entry.
> 3. PHI fields (heart rate, GPS) live in a restricted table with column-level access and audit logging on every raw read.

---

### Walk the requirements

#### Step 1: Leaderboard reflects a workout within a minute

Workout events flow from three providers' webhooks onto an event bus and into a streaming consumer that updates the leaderboard within a minute. Without a streaming tier the user-facing reward feels delayed; without a bus the three webhooks have no fan-in point.

#### Step 2: Dedup by user-and-workout so duplicate-source events collapse

When a user records the same workout on a hardware device and a wearable, two provider events arrive. The streaming consumer dedups on (user_id, workout time window, type) so the leaderboard records one entry. The duplicate's data merges with the canonical entry (or a precedence rule chooses the higher-fidelity source). A 'count every event' design is the version where the leaderboard double-counts; the dedup is the contract.

#### Step 3: PHI restricted with audit on raw reads

Heart rate and GPS are PHI. They live in a restricted table with column-level access tied to permissioned consumers (e.g., the user's own coach with permission). Raw reads write to an audit log so the audit can answer who saw what. The leaderboard reads aggregated, non-PHI columns. Putting PHI on the leaderboard's table is the version where one direct query exposes it; restricted column with audit is the contract.

---

### The shape that fits

> **What this design gives up**
>
> Canonical mapping requires every provider to map at ingest; user-and-workout dedup adds state per active user; PHI restriction requires permissioned access and audit logging on raw reads. Implementation cost is the price; the win is leaderboard within a minute, no double-count from multi-source workouts, and PHI that doesn't leak through the leaderboard's path.

> **What reviewers check**
>
> A reviewer looks at the canvas for these properties:
> - An event bus carries canonical workout events from all three providers' webhooks.
> - A streaming path delivers leaderboard updates within a minute, deduped per user-and-workout.
> - PHI fields restrict to permissioned consumers; raw reads write to an audit log.

> **The mistake that ships**
>
> What gets shipped runs one streaming consumer that updates the leaderboard from every event. The same workout from two providers shows up as two entries. PHI sits on the same table the leaderboard reads and a direct query exposes it. The eventual rebuild adds canonicalization, user-and-workout dedup, and the restricted PHI path with audit.

---

## Common follow-up questions

- A provider sends the same workout twice from a single device. What in this design protects the leaderboard? _(Tests whether the candidate sees the dedup window catching exact-duplicate events from one provider too: the dedup key is (user, workout_id) where the provider's id is included; the second arrival is idempotent. The cross-provider dedup uses the user-and-window key.)_
- A coach asks for raw GPS for a user who consented. What does this design do, and what does the audit show? _(Tests whether the candidate sees the access policy as the gate: the coach's role permits the read, the consent is verified, the read writes an audit-log entry with the coach, the user, and the time. The audit can answer 'who saw what' for any later question.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/three_providers_one_workout)
- [System Design Interview Questions](https://datadriven.io/data-engineering-system-design)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.