# The Analysts Cannot Touch Production

> Production is the source. Analytics needs its own copy.

Canonical URL: <https://datadriven.io/problems/the_analysts_cannot_touch_production>

Domain: Pipeline Design · Difficulty: medium · Seniority: L5

## Problem

Our product runs on a transactional database that the analytics team cannot query directly. We need to replicate it to a separate analytics system where they can run reports. The source data is messy and split across several tables that need to be normalized and combined before it is useful. Design a pipeline to replicate and prepare this data.

## Worked solution and explanation

### Why this problem exists in real interviews

Analytics needs the production data without touching production. Translation: zero query load, zero new writes, no email or phone sitting in the warehouse, and dashboards have to be ready by 4am or analysts open empty pages. The trap is treating this as 'set up nightly extracts.' Nightly extracts are queries, queries are load, and the product team has already said no.

The natural draw is a nightly job that SELECTs from production and writes to the warehouse. The DBA notices the read load on the OLTP at 2am and asks the data team to back off. The team moves the extract to 1am, which works for a week and then the production maintenance window runs over and the extract slides into 5am, after the analysts have already opened the dashboards. Meanwhile, raw email and phone are sitting in the warehouse because masking was a stage nobody got to.

> **Trick to Solving**
>
> Read the change log, not the tables; mask before the warehouse, not after; let the orchestrator alert before 4am, not at 4am.
> 
> 1. Log-based CDC reads the WAL/binlog the database is already writing. Zero query load on the OLTP, which is the constraint the product team set.
> 2. Sensitive fields are masked or hashed at the connector or in the staging step, before the warehouse holds them. The warehouse never sees raw email or phone.
> 3. An orchestrator owns the 4am deadline: each transform task has its own SLA, and alerts fire before 4am if any stage is at risk.

---

### Walk the requirements

#### Step 1: Capture changes off the log, not by querying tables

The product team has refused query load on production. Log-based CDC (Debezium, Datastream, GoldenGate) reads the WAL or binlog the database is already writing for replication, runs as a separate process, and adds zero read load. The change events flow into a stream, then into the warehouse. The version that schedules nightly SELECTs is the version that gets the data team blocked from production.

#### Step 2: Land transformed data before 4am, with alerting before

Analysts open dashboards at 4am. The orchestrator schedules the CDC consumer, the normalization transforms (which combine the messy split tables into one consumable model), and the warehouse load, with the whole chain sized to land before 4am with margin. Sensors on each stage alert before 4am if anything is at risk, so on-call has hours to fix, not minutes. Without an orchestration layer there's nothing watching the deadline.

#### Step 3: Mask sensitive fields before the warehouse, not after

Raw email and phone can't sit in the analytics warehouse. The masking step runs at the connector or in the staging transform: emails become hashed identifiers (or domain-only), phones become null or last-four. The warehouse only ever holds the masked form. A 'we'll mask in the BI tool' approach leaves raw data in tables an analyst can query with a different tool, which is the audit failure the requirement names. Masking belongs at the boundary, before the warehouse.

---

### The shape that fits

> **What this design gives up**
>
> Log-based CDC means running connector infrastructure (Kafka Connect with Debezium, or a managed equivalent) and managing a replication slot the DBA will worry about. Masking before the warehouse means the analytics warehouse can never reverse a hash if a regulator asks; raw values stay only in production. The orchestrator and the per-stage sensors are infrastructure that has to be operated. More moving parts is the cost; in return, zero query load on production, dashboards that are ready at 4am, and a warehouse that survives a privacy review.

> **What reviewers check**
>
> A reviewer looks at the canvas for these properties:
> - A change-data-capture path reads the production log without adding query load.
> - An orchestrator sequences the overnight DAG with per-stage SLA alerting before the 4am dashboard window.
> - Sensitive fields are masked at the boundary so the warehouse never holds raw PII.

> **The mistake that ships**
>
> The first version out the door uses nightly SELECTs against production, masks 'in the BI tool,' and runs without per-stage SLA monitoring. The DBA notices the read load and the data team gets blocked from production. The team moves the extract earlier; production maintenance bumps it later; analysts open empty dashboards at 4am for a week. Privacy review finds raw email in the warehouse and the team takes a finding. The eventual rebuild is log-based CDC, masking at the boundary, and orchestrated SLAs. The conversation that started at 'set up nightly extracts' had to learn each of those properties the hard way.

---

## Common follow-up questions

- A new column is added to a production table. What changes in the connector, and what changes downstream? _(Tests whether the candidate sees that log-based CDC connectors have to handle schema evolution gracefully (additive columns flow through; renames or type changes need attention), and that the warehouse layer absorbs the new column either by adding it to the staging table or by rolling it into the gold model with a default.)_
- An analyst asks for a raw email lookup to debug a customer issue. Where do they go, and why isn't the warehouse the answer? _(Tests whether the candidate has actually drawn the privacy boundary. The warehouse holds only masked email; raw email lives in production. The lookup goes through a controlled, audited path against production (or a separate, restricted-access tooling), not through analytics.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/the_analysts_cannot_touch_production)
- [System Design Interview Questions](https://datadriven.io/data-engineering-system-design)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.