# Every Format Imaginable

> PDFs, HL7, JSON. All of it lands in the same lake.

Canonical URL: <https://datadriven.io/problems/every_format_imaginable>

Domain: Pipeline Design · Difficulty: hard · Seniority: L5

## Problem

We're a health data company that aggregates records from hospitals, clinics, and labs. The data comes in every format imaginable: structured claims data, semi-structured HL7 messages, PDF lab reports, and free-text clinical notes. We need all of it in one place where analysts can query it. Design the data lake pipeline.

## Worked solution and explanation

### Why this problem exists in real interviews

Four sources in four formats with two consumer types and a privacy boundary that has to hold against everyone but the clinical team. The trap is treating the format problem as the hard problem: 'parse them all, write to a table' is the easy half. The hard half is making the canonical record cheap to query for analysts, fast enough for clinical decision support, and de-identified for research without ever exposing PHI.

The default draw is one ETL pipeline per format that lands rows in one big table. Analysts can query, but clinical decision support is reading the same table on the same hourly cadence and missing minutes-fresh hospital events. Researchers query the same table and the access policy is 'we'll trust them not to look at PHI columns,' which is the access posture that fails the next privacy review.

> **Trick to Solving**
>
> Per-format parsing into a canonical record, two paths sized for two consumers, PHI on a restricted column with researchers reading a separate de-identified view.
> 
> 1. Each source format gets its own parser; all of them produce the same canonical patient-event schema. Analysts query the canonical view, not four shapes.
> 2. Raw files (HL7, PDF, free-text notes) stay in the data lake unchanged for re-parse and audit. Reprocessing a parser change runs against the raw, not against a derived table.
> 3. Hospital events ride a streaming path to a clinical-decision-support store; the other three formats land in batch on a slower cadence into the warehouse.
> 4. PHI is on a restricted column, with researchers reading a separate de-identified view; the platform enforces access regardless of how the query is written.

---

### Walk the requirements

#### Step 1: Canonical patient-event record across four formats

Each source has its own parser: claims into structured rows, HL7 into structured rows, PDF lab reports into OCR'd structured rows, clinical notes into structured + free-text fields. All four parsers emit the same canonical patient-event schema. Analysts query the canonical view; the four-format problem is upstream. Without a warehouse/lakehouse tier the canonical view has nowhere to live; without keeping raw files in cold storage you can't re-parse when a parser improves.

#### Step 2: Restricted PHI column with researchers on a de-identified view

Raw PHI is on a restricted column accessible only to the clinical team. A separate de-identified view feeds research consumers, with hashed identifiers and any fields that can re-identify either masked or coarsened. The platform enforces who can read which view; researchers can't reach the restricted column regardless of how they write the query. A 'trust people to filter PHI in their query' approach is what fails a privacy review.

#### Step 3: Hospital events on a streaming path; the rest batch

Clinical decision support reads hospital events within minutes; other sources tolerate a day. Hospital events flow through a streaming consumer that updates a CDS store within a minute; claims, PDFs, and notes flow through batch loaders into the warehouse on a slower cadence. Two paths off four sources, sized to the consumer. Forcing all four onto the streaming path is over-engineering; forcing CDS to read from the warehouse misses the freshness it needs.

---

### The shape that fits

> **What this design gives up**
>
> Four parsers plus a streaming path plus restricted-column access plus a de-identified view is more pieces than 'one ETL.' The canonical schema has to be designed and maintained as the formats evolve. Pipeline simplicity is what gets sacrificed; in return, analysts can ask one question across four sources, CDS lands within minutes, and the privacy boundary holds.

> **What reviewers check**
>
> A reviewer looks at the canvas for these properties:
> - Raw files stay in cold storage so reprocessing a parser improvement doesn't require a re-export from sources.
> - A streaming path serves clinical decision support within minutes; the unified canonical record is queried in a warehouse with column-level access policies.

> **The mistake that ships**
>
> The first version out the door uses one ETL per format into one warehouse table, runs everything on the same hourly cadence, and 'masks' PHI by trusting researchers not to query the columns. CDS misses minutes-fresh hospital events because it's reading the hourly batch table. A privacy review finds a researcher's query that pulled raw identifiers and the team takes a finding. A parser bug means re-parsing a year of files, and the team realizes nothing kept the raw originals. The eventual rebuild is per-format parsers, two paths, restricted columns, and a raw-lake archive. The privacy review and the unparseable backlog hit at the same time, and CDS is offline for hours during the worst of it.

---

## Common follow-up questions

- A new parser improves PDF extraction quality. What in this design lets you reprocess past files, and what doesn't? _(Tests whether the candidate sees the raw lake as the reprocessing source: the new parser runs over the raw files and overwrites the PDF-derived rows in the canonical warehouse. The streaming CDS path doesn't apply (it's only for hospital events).)_
- Researchers ask for a cohort across claims and clinical notes that requires joining on a quasi-identifier (e.g. provider id + date of service). Where does this design draw the line? _(Tests whether the candidate sees that quasi-identifiers can re-identify in small cohorts, even on a 'de-identified' view: the access policy or a stricter export gate has to enforce minimum cohort sizes or apply additional generalisation. The view alone isn't enough.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/every_format_imaginable)
- [System Design Interview Questions](https://datadriven.io/data-engineering-system-design)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.