# The Patients We Cannot Move

> Patient data stays local. Insights have to be global.

Canonical URL: <https://datadriven.io/problems/the_patients_we_cannot_move>

Domain: Pipeline Design · Difficulty: hard · Seniority: L6

## Problem

We run federated machine learning across hospital networks for clinical trial research. Each hospital has patient data we're not allowed to move - privacy law and patient consent don't permit central aggregation. We need to train models and compute population statistics across data that is physically distributed across 40 hospitals in 8 countries, each with different EHR systems and data formats. Design a data pipeline that makes this possible.

## Worked solution and explanation

### Why this problem exists in real interviews

Federated learning across hospitals turns a normal data pipeline inside out. The data doesn't move; the computation does. The trap is hidden in the requirements: even a perfectly federated training round can leak patient identity through small-cohort aggregates, and a published result years later has to reproduce from data nobody can see. The design has to respect those constraints from day one or unwind them later.

The natural draw is to push training jobs to each hospital, collect gradients to a central server, and aggregate. The model trains. A study publishes a small-cohort analysis and a privacy researcher demonstrates patient re-identification from the released aggregates. Three years later a regulator asks to reproduce a published result and the team has the model but not the per-hospital data snapshots, the code version, or the gradient hashes that produced it. The federated property held; the reproducibility and re-identification properties didn't.

> **Trick to Solving**
>
> Compute travels to data, schema converges before training, small cohorts withheld, every round logged for years.
> 
> 1. An orchestrator coordinates federated rounds: it pushes the training task to each hospital, tracks completion, aggregates returned gradients, never receives raw records.
> 2. Each hospital normalizes its local EHR codes to a canonical clinical schema before training. The mapping lives at the hospital, not centrally.
> 3. An aggregate-release gate withholds or generalizes statistics from cohorts under a privacy threshold; nothing crosses the boundary that fails the check.
> 4. Per-study, per-hospital archive: the data snapshot identifier, the code version, the gradient hashes, retained for the regulatory window. The study's reproducibility is a query, not a hunt.

---

### Walk the requirements

#### Step 1: Compute runs inside each hospital; only results leave

An orchestrator pushes the training round to each hospital's compute environment. Each hospital trains locally on its own data, returns gradients (or aggregates), and never sends raw records out. The aggregator at the central side combines gradients and starts the next round; if a hospital is down, the round either waits, proceeds without it, or is rescheduled per the orchestration rules. Without the orchestrator there's no coordinator for rounds; without the federated pattern raw data crosses the boundary, which the law forbids.

#### Step 2: Each hospital maps local EHR codes to a canonical schema

Cross-site research needs the same concept to mean the same thing in every hospital. Each hospital owns a mapping from its local EHR codes to one canonical clinical schema (SNOMED, LOINC, or a study-specific schema). The mapping is applied locally before training, so the gradients reflect canonical semantics. A 'we'll harmonize centrally' approach moves data; a 'we'll do it per-hospital but inside the federated round' approach keeps the data in place while making the result comparable.

#### Step 3: Hold back small-cohort aggregates so they can't re-identify

Even aggregate counts can re-identify patients in small cohorts. An aggregate-release gate sits between each hospital's local computation and what leaves the boundary: aggregates over cohorts smaller than a privacy threshold are withheld, generalized to a coarser bucket, or noised (e.g. differential privacy mechanisms). Nothing crosses the boundary that fails the check. Without the gate, even a 'just counts' result can leak.

#### Step 4: Per-study reproducibility archive that survives the retention window

Each round writes per-study, per-hospital records to a durable archive: the data snapshot identifier (a content hash of the local dataset at training time), the code version, the gradient hash, the round id. Retention is the regulatory window. When a regulator asks to reproduce a published result years later, the archive points at the exact data version and code that produced it. Without the archive, reproducibility becomes a forensic exercise that can't be completed; with it, the answer is a query.

---

### The shape that fits

> **What this design gives up**
>
> Federated training is more complex than central training. Aggregate-release gates introduce per-cohort thresholds (and noise) that make some research questions answerable and others not. Per-study archives across many hospitals over a long retention window cost real storage. Model-training simplicity and full statistical access are what get sacrificed; in return, a system the law allows, results that don't re-identify, and studies that reproduce when a regulator asks.

> **What reviewers check**
>
> A reviewer looks at the canvas for these properties:
> - An orchestration layer coordinates federated training rounds across hospitals; raw data never leaves a hospital boundary.
> - Per-study archive holds snapshot ids, code versions, and gradient hashes for the regulatory retention window.

> **The mistake that ships**
>
> The build that ships pushes training jobs to each hospital, collects gradients centrally, releases aggregates without a privacy gate, and treats reproducibility as 'we have git history.' A published study's small-cohort statistic gets re-identified by an outside researcher, the institutions take a privacy finding, and a regulator asks to reproduce a different published result; the team has the model but no record of which data snapshot or code version produced it. The eventual rebuild is the release gate and the per-study archive. One arrives after a privacy researcher demonstrates re-identification publicly; the other arrives after the regulator's reproduction request.

---

## Common follow-up questions

- A hospital wants to drop out of an in-progress study. What in this design lets them, and what changes for the model already trained? _(Tests whether the candidate sees that the orchestrator can exclude the hospital from the next round, the model retains the gradients it already received, and the published result either keeps the dropped hospital's prior contribution or is reanalysed without them. The federated pattern doesn't require all hospitals to stay; the archive records who participated.)_
- A regulator subpoenas the raw patient data behind a published result. Where does the design send them, and where does it not? _(Tests whether the candidate sees that the raw data lives only in each hospital's environment, governed by local law and consent. The archive shows what was used (snapshot id, code version, gradient hash) but the raw data stays in the hospital. A subpoena is served to the hospital, not to the central platform.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/the_patients_we_cannot_move)
- [System Design Interview Questions](https://datadriven.io/data-engineering-system-design)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.