# Out of the Data Center

> The on-prem servers are not getting any younger.

Canonical URL: <https://datadriven.io/problems/out_of_the_data_center>

Domain: Pipeline Design · Difficulty: medium · Seniority: L6

## Problem

We have a legacy data platform running on-premises that is expensive to maintain and can't scale. We need to move our data pipelines to the cloud without disrupting the analytics team or breaking downstream reports. Design the migration architecture.

## Worked solution and explanation

### Why this problem exists in real interviews

A cloud migration is mostly a discipline problem disguised as a tech problem. Three prior attempts failed; leadership wants proof, executives expect dashboards every morning, and a rollback for one pipeline can't take the rest with it. The trap is treating it as 'lift and shift' and discovering on cutover Monday that something doesn't match and there's no way to revert that one pipeline without rolling back the migration of every other.

The default reach is to build the cloud version, run it once on a Sunday, and switch the dashboards over Monday morning. The first pipeline doesn't quite match the on-prem output and analytics flags it midweek. There's no individual rollback path so the team rolls back everything that migrated that weekend; the next attempt waits a quarter. Connection strings changed for analytics users and workbooks broke too. Leadership reads the postmortem and the question becomes 'what's different this time.'

> **Trick to Solving**
>
> Per-pipeline dual-run with a parity gate and per-pipeline rollback, a stable connection layer for analytics, an orchestrator that owns morning SLAs across both halves of the migration.
> 
> 1. Each pipeline runs in both cloud and on-prem during a defined dual-run window; a daily diff job feeds a parity gate the orchestrator uses to decide cutover.
> 2. Per-pipeline cutover means each pipeline flips independently; a misbehaving one rolls back to its on-prem version while the rest continue.
> 3. Analytics users connect to a logical/aliased connection that resolves to whichever warehouse is currently authoritative for that pipeline; a cutover or rollback is a connection-routing change, not a per-user rebuild.
> 4. The orchestrator owns the morning-SLA contract across both halves of the migration: regardless of which side ran the pipeline today, the data is in the analytics layer by 7:30am.

---

### Walk the requirements

#### Step 1: Morning dashboards keep working through the migration window

The orchestrator schedules each pipeline on whichever side is authoritative on a given day (some still on-prem, some on cloud during dual-run, some cut over) and gates the morning SLA across both. Sensors fire ahead of 7:30am if either side is at risk. Without an orchestration layer there's nothing watching the deadline across both halves; without a cloud warehouse the migration has no target.

#### Step 2: Dual-run with a parity gate before any cutover

Each pipeline runs in both cloud and on-prem during the agreed window. A daily diff job compares outputs and the orchestrator publishes the result; cutover for that pipeline is gated on the diff staying within tolerance for the agreed duration. Leadership signs off on the gate's record, not on a manual eyeball comparison. Three prior attempts failed because nobody had a numerical reason to flip the switch; the parity gate is what gives this attempt one.

#### Step 3: Per-pipeline cutover and rollback

Cutover and rollback are per-pipeline operations the orchestrator handles independently. When pipeline A's parity gate is met, A cuts over to cloud while B and C are still dual-running. If A misbehaves after cutover, A reverts to on-prem without touching B or C's migration state. A 'big bang' Monday cutover puts every pipeline at risk if any one of them fails; per-pipeline isolation is what made three prior attempts not work and what this attempt has to fix.

#### Step 4: Analytics users connect to a stable layer regardless of side

Analytics tools connect to a logical layer (a connection alias, a routed warehouse, or a view layer over both) that resolves to whichever side is authoritative for each table. A pipeline cutover or rollback is a routing change at the connection layer; user workbooks keep running. A 'tell every user to update their connection on cutover day' approach is the version where workbooks break and analytics is doing connection forensics instead of analytics.

---

### The shape that fits

> **What this design gives up**
>
> Dual-run roughly doubles compute for the migration window; per-pipeline cutover and rollback adds a control plane the orchestrator has to operate; the analytics-side stable connection layer is infrastructure that has to live on both sides of the migration. Implementation cost is the price; the win is morning dashboards that don't see a gap, leadership signing off on a numerical gate rather than a hope, individual rollback that doesn't drag the rest down, and analytics users who don't notice the migration happened.

> **What reviewers check**
>
> A reviewer looks at the canvas for these properties:
> - An orchestration layer schedules pipelines on both cloud and on-prem during dual-run with sensors and alerts before the morning SLA on either side.
> - Each pipeline dual-runs and the cutover is gated on a daily parity diff staying within tolerance.
> - Cutover and rollback happen per-pipeline; one pipeline reverting to on-prem doesn't disturb the others.
> - Analytics users connect through a stable layer that points at whichever warehouse is authoritative for each table; user-side workbooks don't change.

> **The mistake that ships**
>
> What gets shipped builds the cloud pipelines, runs them on a Sunday, and cuts over Monday morning. The first pipeline doesn't quite match on-prem; there's no individual rollback path so everything that migrated that weekend rolls back together. Analytics workbooks broke because connection strings changed for users. Leadership reads another postmortem and the migration moves to next quarter. The eventual rebuild adds dual-run, per-pipeline parity, per-pipeline rollback, and a stable connection layer; each was reachable in the original conversation if the team had taken 'leadership won't approve without proof' as architecture instead of process.

---

## Common follow-up questions

- After cutover, a pipeline starts producing slightly different output from what it produced during dual-run. What does this design do, and where do you look first? _(Tests whether the candidate sees that post-cutover the parity gate is gone (only one side runs) and the divergence has to be diagnosed against historical dual-run records. The fix is either rolling that pipeline back to on-prem (per-pipeline rollback) or fixing the cloud version while it's still authoritative; the run log of past comparisons is where the diagnosis starts.)_
- Three pipelines depend on a fourth that's still on-prem during their dual-run. How does the design handle the dependency? _(Tests whether the candidate sees that the orchestrator's DAG spans both sides during dual-run; downstream pipelines on the cloud read from the on-prem warehouse via the connection layer until the upstream pipeline is cut over. The dependency doesn't force the upstream to migrate first if the connection layer can resolve a cross-side read.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/out_of_the_data_center)
- [System Design Interview Questions](https://datadriven.io/data-engineering-system-design)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.