# Real Data, Fake Patients

> Dev needs production data. HIPAA says absolutely not.

Canonical URL: <https://datadriven.io/problems/real_data_fake_patients>

Domain: Pipeline Design · Difficulty: hard · Seniority: L6

## Problem

Our data engineering team works with protected health information in production and we need a way to provide engineers with realistic data in development and testing environments without exposing real patient records. Design a pipeline that ingests from production, de-identifies PHI, and delivers a statistically representative synthetic dataset to the dev environment.

## Worked solution and explanation

### Why this problem exists in real interviews

De-identification looks like a one-shot transformation problem and isn't. The trap is treating it as 'replace identifiers with random values.' Random values fail engineers immediately because joins break: the same patient id has different tokens in two tables and tests pass on broken code. And the schema is a moving target, so a new sensitive column can quietly land in dev next week if classification isn't a gate.

The simple answer is a one-time job that copies prod, replaces every identifier with a random value, and writes to dev. Joins between two tables on patient id come back empty because the random replacement was independent per table. The dev environment passes its tests on data that doesn't represent prod's join behavior. A week later, a new column lands in prod, the copy job picks it up, and engineers in dev are reading raw values that turn out to be sensitive.

> **Trick to Solving**
>
> Tokenize identifiers consistently across tables, classify every column before it ships to dev, and audit every run.
> 
> 1. Tokenization uses a stable mapping from real identifier to synthetic identifier. A patient's token in one table is the same token in every other table. The mapping lives in a vault no engineer can reach.
> 2. Schema changes go through classification before they ship to dev: a new column is either approved (with a tokenization or scrub rule) or blocked. Unclassified fields don't flow.
> 3. An orchestrator coordinates the per-run dependency order (parents before children for FK consistency), runs PHI scanning, and writes a per-run audit log to a queryable archive.

---

### Walk the requirements

#### Step 1: Synthetic identifiers, mapping in a vault no engineer can read

Every PHI value is replaced with a synthetic value. Stable identifiers (patient id, provider id, encounter id) use a deterministic mapping so the same real id always produces the same synthetic id. The mapping table lives in a restricted vault in production accounts that engineers cannot read; the dev dataset only ever holds tokens. A 'random per row' replacement is the version where joins break in dev and tests pass on broken code.

#### Step 2: Consistent tokenization so joins behave like prod

If patient_id 12345 is tokenized to 'pat_abc' in the patients table, the same 12345 is tokenized to 'pat_abc' in the encounters table, the labs table, every related table. The mapping is the contract. Engineers run a join on patient_id in dev and get the same shape they'd get in prod, just with synthetic values. A 'each table tokenizes independently' design is what makes the tests lie.

#### Step 3: Classify new columns before they ship to dev

Production schema changes weekly. Before any new column reaches dev, it goes through classification: PHI rules are checked (with both a column-name allowlist and a content-pattern scan), and the column is either marked safe, marked sensitive (with a tokenization or scrub rule attached), or blocked. Unclassified columns don't flow to dev; the de-id pipeline halts and pages, and the schema review unblocks it. A 'pass through unknown columns' default is the version where a new sensitive field lands in dev quietly.

#### Step 4: Auditable per-run record with no PHI in the audit log

Every de-id run writes an audit record: timestamp, source schema version, classification rules applied, row counts, scan results, the run id of the produced dataset. Retention is the HIPAA window. The audit log itself contains no PHI; it references row counts and rule ids, not patient values. When a regulator asks for proof that a dev dataset was de-identified on a given date, the answer is a query on the archive. Without an orchestration layer there's nothing capturing the audit per run; without a cold-storage tier the audit has nowhere to live.

---

### The shape that fits

> **What this design gives up**
>
> A consistent-tokenization vault is more infrastructure than 'replace with random.' Classification-before-dev means schema changes have a review step and a delay; an unclassified column blocks the next dev refresh until it's reviewed. An audit log per run grows linearly with refresh cadence. Speed of dev refresh is the cost; what arrives is joins that work, no PHI leaking on schema drift, and an audit story that survives a HIPAA review.

> **What reviewers check**
>
> A reviewer looks at the canvas for these properties:
> - An orchestrator runs each de-identification refresh in dependency order, with a classification gate that blocks any unclassified column from reaching dev.
> - The de-identified dev dataset and a per-run audit archive both live in cold storage; the original-to-token vault is in production where engineers can't reach it.

> **The mistake that ships**
>
> What gets built first does a one-time copy with random replacement and trusts engineers not to run integration tests against the dev dataset. Joins on patient_id come back empty in dev; tests pass on broken code that fails in prod. A new sensitive column lands in prod the next week, the copy job picks it up, and engineers spend hours with raw values that should have been blocked. The team rebuilds with consistent tokenization, classification gates, and audit logging. The HIPAA finding is the part that doesn't go away when the rebuild ships.

---

## Common follow-up questions

- An engineer asks for a specific dev row tied to a specific real patient. What in this design lets them, and what doesn't? _(Tests whether the candidate sees the vault as the only path back from synthetic to real, and access to the vault is restricted to compliance/clinical roles. Engineers can't reach the vault even if they ask politely; the boundary is a property of the platform, not goodwill.)_
- A regulator asks: prove that the dev dataset from a given date contained no real PHI. What in this design produces that proof? _(Tests whether the candidate sees the audit archive plus the PHI scanner's results as the proof: the audit record references the run, the scanner's results, and the dataset id. The dev dataset itself is queryable to confirm only synthetic values; the audit log says how that was guaranteed.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/real_data_fake_patients)
- [System Design Interview Questions](https://datadriven.io/data-engineering-system-design)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.