# Thousands of Practices, One Dataset

> Patient records in, operational insights out.

Canonical URL: <https://datadriven.io/problems/thousands_of_practices_one_dataset>

Domain: Pipeline Design · Difficulty: hard · Seniority: L6

## Problem

We operate a cloud-based EHR platform used by thousands of medical practices, and each practice's patient and clinical data is stored in a shared database with tenant isolation enforced at the application layer. Our analytics team needs to run cross-tenant population health analytics and individual practice performance reporting without exposing one practice's data to another. Design the pipeline architecture that supports both use cases with proper tenant isolation.

## Worked solution and explanation

### Why this problem exists in real interviews

A multi-tenant EHR pipeline is multi-tenant security plus per-tenant SLA plus a population-health workload that legally cannot see real patients. The trap is that any one of these is solvable, and putting them in one warehouse without tenant-aware machinery means the largest practices steal capacity from the smallest, and the analytics team gets visibility into PHI through a misconfigured query.

The default move is one nightly job that pulls everyone's data into one big table, with `WHERE practice_id =` on every dashboard query. Big practices' nightly load runs long and small practices' data is delayed past the morning dashboard. The epidemiology team gets access to the same table 'just temporarily' and ends up with PHI they shouldn't have. A query without the practice filter pulls another practice's patients, which is a HIPAA breach by lunchtime.

> **Trick to Solving**
>
> Per-tenant capture, per-tenant SLA, row-level security plus pseudonymization, no shared queue for big and small.
> 
> 1. Per-practice CDC paths (or partitioned consumers) so a big practice's backlog doesn't park behind itself and block small practices.
> 2. The warehouse holds practice rows under row-level security; population-health queries hit a separate, pseudonymized view that never reveals real identifiers.
> 3. Pseudonymization happens before the population-health path sees the data; the mapping lives in a restricted store nobody else can read.
> 4. T+1 SLA is per-practice, not just for the largest. Per-tenant ingest paths means each practice's freshness is independent.

---

### Walk the requirements

#### Step 1: Tenant boundary held by row-level security at the warehouse

The warehouse enforces row-level security on every practice-scoped table: a practice's user sees only their own rows, regardless of the query. The application layer's filter doesn't matter once data leaves the OLTP; the warehouse owns the boundary now. A 'we'll filter in the BI tool' approach is one forgotten clause from a HIPAA breach. Without a warehouse tier the policy has nowhere to be enforced; without sub-minute capture the warehouse is too stale for the morning dashboards.

#### Step 2: Pseudonymized population-health view with no real identifiers

Epidemiology runs cross-practice research. They get a separate view fed from the same underlying data, but with patient identifiers pseudonymized: deterministic hashes (or generated tokens) instead of MRN, name, DOB. The mapping table from real to pseudo lives in a restricted store the population-health team cannot read. Their queries see anonymous cohorts; if a regulator asks 'who is patient X really,' the answer is found through a controlled, audited path against the mapping store.

#### Step 3: Per-practice ingest so a big tenant can't starve a small one

A handful of large practices generate most of the volume. Per-practice CDC paths (or per-practice partitioning on a shared stream, with the ingest workers scaled per partition) ensure a backlog from a big practice parks behind itself, not in front of the small practice that needs T+1 freshness too. The orchestrator schedules per-practice tasks; alerts are per-practice so on-call can tell which tenant is slow. A single shared queue is the version where small practices wait while a big practice catches up.

#### Step 4: Each practice's prior-day data lands before morning

T+1 freshness has to hold for every practice, not just the largest. The orchestrator's per-practice DAG runs each tenant's daily aggregation overnight, with sensors that fire before the morning dashboard window if any practice is at risk. Alerts page on-call by name of the practice; on-call has hours to intervene rather than minutes. A 'we'll get to it after the big practices finish' design is the version where small-practice managers open empty dashboards.

---

### The shape that fits

> **What this design gives up**
>
> Per-practice CDC and per-practice DAG nodes is more orchestration than a single nightly job. Pseudonymization adds a separate restricted mapping store and a transformation step on every population-health refresh. Row-level security adds query-rewrite cost on every read. Operational simplicity is the cost; the win is a tenant boundary that survives HIPAA, a population-health view that doesn't expose PHI, and a freshness SLA the smallest practice can rely on.

> **What reviewers check**
>
> A reviewer looks at the canvas for these properties:
> - Per-tenant change-data-capture isolates a slow tenant from blocking the others.
> - Row-level security on the warehouse enforces practice boundaries; population-health queries hit a separately pseudonymized view.

> **The mistake that ships**
>
> The first version out the door pulls everyone's data into one warehouse table with `WHERE practice_id =` enforced in BI, and gives epidemiology read access 'just for now.' One missing filter returns another practice's patients in a CSV export, which is a HIPAA breach with paperwork. A big practice's nightly load runs long and small practices' dashboards stay empty until midmorning. Population-health analyses run on real PHI because nobody pseudonymized first. The team rebuilds with per-practice ingest, row-level security, and a pseudonymized view. The platform takes a HIPAA finding, the small practices spend a quarter without trustworthy dashboards, and the eventual rebuild touches ingest, warehouse, and access policy together.

---

## Common follow-up questions

- A regulator asks: prove that population-health researchers cannot reconstruct a patient's identity from the view they have. What in this design backs that up? _(Tests whether the candidate sees that the pseudonymizer's mapping store is the only path back to identity, and access to that store is audited and restricted to a separate role. The view itself contains no quasi-identifiers strong enough to re-identify; if it did, you'd need additional anonymity checks before exposure.)_
- A small practice's data is stuck for hours behind a backfill from a large practice that just onboarded. What in this design lets that not happen, and what would you change if it did? _(Tests whether the candidate sees per-practice ingest pools (separate workers, separate queues) as the prevention mechanism. If the failure happens anyway, the fix is moving the backfill to a backfill-only worker pool so the steady-state per-practice ingest stays unaffected.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/thousands_of_practices_one_dataset)
- [System Design Interview Questions](https://datadriven.io/data-engineering-system-design)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.