# Nightly Exports Are Too Slow

> Healthcare claims change constantly. The warehouse cannot fall behind.

Canonical URL: <https://datadriven.io/problems/nightly_exports_are_too_slow>

Domain: Pipeline Design · Difficulty: medium · Seniority: L5

## Problem

Our healthcare analytics platform needs near-real-time access to claims and member data that lives in several operational databases. We have been using nightly full exports, but this is too slow for utilization management teams and creating performance problems on the source systems. Design a CDC-based replication pipeline using PySpark that keeps the warehouse current without impacting production.

## Worked solution and explanation

### Why this problem exists in real interviews

The named problem is the nightly export missing UM's window, but the harder constraint is the database teams refusing any new load and HIPAA refusing any raw PHI in the warehouse. The trap is replacing nightly with hourly polling: faster, but still query load on production, and PHI still ends up in the warehouse if masking is an afterthought.

The natural shape is hourly SELECTs against the source databases that pull only changed rows. UM is happier; database teams notice the new load and ask the data team to back off. Masking gets done in the warehouse via views, which means raw PHI lives in the underlying tables, and the first HIPAA assessor finds it. A consumer stalls and the source database's CDC slot grows quietly until disk pressure causes its own incident.

> **Trick to Solving**
>
> Log-based CDC, mask before the warehouse, monitor replication lag and bound the slot.
> 
> 1. Log-based CDC reads the WAL/binlog the database is already writing; zero query load on the OLTP. The database team has refused the alternative.
> 2. PHI is masked or pseudonymized at the connector or in the staging step, before the warehouse holds it. The warehouse only ever holds the de-identified form.
> 3. The replication slot has a bounded retention; if a consumer stalls past the bound, the connector alerts before it threatens the source. Unbounded slot growth on a stalled consumer is the failure mode the database team is worried about.

---

### Walk the requirements

#### Step 1: Adjudicated claims to the warehouse within hours, on a streaming path

CDC events for adjudicated claims flow into a stream that lands them in the warehouse within hours, not at the end of the night. UM reads the warehouse for same-day authorization calls. A six-hour nightly export is the named problem; without a streaming/CDC path the requirement is unaddressed. Without a warehouse tier the analytics layer has nowhere to live.

#### Step 2: Mask PHI before the warehouse, not in views over it

Raw PHI can't sit in the analytics warehouse. The masking step runs at the connector or in the staging transform, before the warehouse load: identifiers become hashed tokens or are dropped, free-text fields are scrubbed. The warehouse holds only the de-identified form. A 'mask in BI views' approach leaves raw PHI in the underlying tables an assessor will query directly. Masking belongs at the boundary, before the warehouse.

#### Step 3: Log-based CDC with bounded replication slots

Database teams have refused query load and write load. Log-based CDC reads the WAL/binlog and adds zero load on the OLTP; that's the configuration the database team will sign for. The replication slot has a bounded retention with monitoring: when a consumer stalls and the slot grows past a threshold, the connector alerts before the slot threatens the source database. Without the bound, a stalled consumer is the failure the database team will reject the design over.

---

### The shape that fits

> **What this design gives up**
>
> Log-based CDC with masking-before-warehouse is more pieces than a nightly export: connectors to operate, replication slots to monitor, a masking step that has to be tested. The bounded slot means a stalled consumer can lose events past the bound, so consumer reliability matters more. Operational complexity is the cost; the win is UM's window being met, no PHI in the warehouse, and a database team that will sign off.

> **What reviewers check**
>
> A reviewer looks at the canvas for these properties:
> - A change-data-capture path off the production log adds zero query load.
> - Adjudicated claims land in a warehouse where PHI has been masked at the boundary, never in the warehouse rows.

> **The mistake that ships**
>
> The design the team ships uses hourly SELECTs against production for incremental pulls and masks PHI in BI views. Database teams notice the new load and ask the data team to back off. A HIPAA assessor finds raw PHI in the warehouse's underlying tables. A consumer stalls during a deploy and the source database's CDC slot grows past disk capacity, taking the OLTP down with it. The team rebuilds with log-based CDC, masking at the boundary, and slot bounds with alerting. The OLTP outage and the HIPAA finding both predate the rebuild; either one alone would have forced it.

---

## Common follow-up questions

- An auditor asks for the lookup of a specific patient's claim history. Where do they go, and where do they not? _(Tests whether the candidate keeps PHI scope tight: raw patient data lives only in the source databases under existing controls. The warehouse holds only de-identified records. The auditor's lookup goes through a controlled, audited path at the source, not through analytics.)_
- The masking step changes (a new identifier needs scrubbing). What in this design needs to be redone, and what doesn't? _(Tests whether the candidate sees the masker as the extension point: a new rule is a code change with a backfill of recently-loaded data. The CDC capture, the stream, and the warehouse layout don't change; only the masking step and the affected partitions do.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/nightly_exports_are_too_slow)
- [System Design Interview Questions](https://datadriven.io/data-engineering-system-design)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.