# Every Dataset Needs a Paper Trail

> The FDA has opinions about your data pipeline.

Canonical URL: <https://datadriven.io/problems/every_dataset_needs_a_paper_trail>

Domain: Pipeline Design · Difficulty: hard · Seniority: L6

## Problem

We're a pharmaceutical company ingesting data from clinical trial systems, commercial sales databases, and patient support program feeds. The data governance team has mandated that every dataset entering the warehouse must have a documented data quality check, a lineage trace, and an access control policy before it goes live. Design the ingestion pipeline and governance framework.

## Worked solution and explanation

### Why this problem exists in real interviews

Pharma warehousing under regulatory scrutiny: FDA wants column-level lineage, HIPAA wants pseudonymization, four consumer tiers want different slices, and pharmacovigilance can't wait for tomorrow's batch. The trap is treating governance as paperwork bolted on after the pipeline runs; what's needed is governance as architecture, not as a wiki page.

The default reach is to wire each source to the warehouse, write masking and quality logic in transforms, and ask each team to filter at query time. FDA asks for the lineage of a number on a regulatory submission and the team starts grepping code. PHI lands in the warehouse before masking happens because the masking lives in a downstream transform. Public-summary users query the table directly and see fields they shouldn't. Adverse events flow through the same nightly path as sales and pharmacovigilance finds out about a serious event the next morning.

> **Trick to Solving**
>
> Lineage and pseudonymization at ingestion, per-tier access enforced at the warehouse, adverse events on a faster path.
> 
> 1. Pseudonymization runs at the connector or in the staging step, before any warehouse table holds the row; the mapping vault is in a separate restricted environment.
> 2. Lineage is a column on each row (source record id, ingestion timestamp, transform version), so a regulator's question is a SQL query.
> 3. Per-tier access (public, commercial, finance, patient research) is enforced by the warehouse with row-and-column policies tied to the user's role.
> 4. Adverse events ride a streaming path that lands them in the warehouse within hours; commercial sales runs on a slower cadence.

---

### Walk the requirements

#### Step 1: Lineage on every row, queryable for FDA

FDA's question is 'what produced this number?' Each warehouse row carries lineage columns: source record id, ingestion timestamp, transform version, quality-check pass timestamp. A regulator's lookup is a SQL query that resolves the value back to the source record. Lineage in a wiki page is the version that fails an inspection because nobody can demonstrate the trace. Without a warehouse tier the lineage columns have nowhere to live; without a quality-check gate there's no recorded pass to attach to the row.

#### Step 2: Pseudonymize at ingestion; the vault is somewhere engineers can't reach

Clinical-trial PHI can't sit in the warehouse. Pseudonymization runs at the connector or in the staging step, before any warehouse table holds the data: identifiers become deterministic tokens, free-text fields are scrubbed. The mapping from real to token lives in a vault in a separate restricted environment. Engineers operating the warehouse can't reach the vault; HIPAA-authorized roles do, through an audited path. Masking in a warehouse view is the version where the underlying tables hold raw PHI an assessor will find.

#### Step 3: Per-tier access enforced at the warehouse

Public summaries, sales reps, national finance, and patient-level researchers each see different slices. The warehouse owns the policy: row-and-column rules tied to each user's role return only what the role is authorized to read. A 'we'll filter in the BI tool' design is one forgotten clause from a tier seeing data from a higher tier. The platform enforces the boundary regardless of how the query is written.

#### Step 4: Adverse events on a faster path

Pharmacovigilance reads adverse-event records and has to see them within the operational window. A streaming path tails the trial system's adverse-event feed and lands rows in the warehouse within hours; commercial sales and other sources run on a slower batch cadence. Forcing adverse events onto the slower batch is the version where pharmacovigilance finds out about a serious event the next morning, which is exactly the failure the requirement is calling out.

---

### The shape that fits

> **What this design gives up**
>
> Lineage columns roughly double the metadata footprint per row; pseudonymization at ingestion adds a vault and a connector hop; per-tier policies are config that has to be reviewed when teams or roles change; the streaming adverse-event path is a separate ingestion the team has to operate. Implementation cost is the price; the win is FDA inspections that pass a query rather than a search, HIPAA audits that find tokens instead of PHI, tier boundaries the warehouse enforces, and pharmacovigilance that doesn't read adverse events the next morning.

> **What reviewers check**
>
> A reviewer looks at the canvas for these properties:
> - Each row in the warehouse carries lineage columns that resolve back to the source record.
> - PHI is pseudonymized at ingestion; the original-to-token vault lives in a separate restricted environment data engineers cannot reach.
> - Per-tier row-and-column access is enforced at the warehouse regardless of how the query is written.
> - A streaming path delivers adverse events within hours; commercial sales runs on a slower batch cadence.

> **The mistake that ships**
>
> What gets shipped wires sources straight to the warehouse with masking in transforms, no per-row lineage, and 'BI tool filtering' for tier separation. FDA asks for the lineage of a regulatory number and the team grep-archaeologys the code. A HIPAA assessor finds raw PHI in the warehouse's underlying tables because the masking transform ran after a window where the raw rows existed. A sales rep runs a query without a tier filter and pulls patient-level data into a CSV. Pharmacovigilance finds out about a serious adverse event the next morning. The eventual rebuild adds pseudonymization at ingestion, lineage columns, warehouse-enforced access, and a streaming adverse-event path; each was a property the original cut treated as paperwork.

---

## Common follow-up questions

- FDA asks for the lineage of a number that was computed from three source records across two trials. What's the query, and where does it run? _(Tests whether the candidate sees lineage as an aggregate of source-record ids on the warehouse row: the query unions the lineage columns across the inputs that produced the number, returning a list of source ids. The audited number is a single SQL query, not a forensic search; the trail lives in the data, not in a wiki.)_
- A new consumer tier (an external regulator portal) needs read access to a specific subset of warehouse rows. What in this design extends, and what doesn't? _(Tests whether the candidate sees the access-policy layer as the extension point: a new role with row-and-column rules scoped to the regulator's subset, no changes to the warehouse layout, the pseudonymization, or the streaming path. The platform handles the new tier through configuration.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/every_dataset_needs_a_paper_trail)
- [System Design Interview Questions](https://datadriven.io/data-engineering-system-design)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.