# The Migration That Cannot Break Morning

> It all works today. Moving it without losing a single report is the hard part.

Canonical URL: <https://datadriven.io/problems/the_migration_that_cannot_break_morning>

Domain: Pipeline Design · Difficulty: hard · Seniority: L5

## Problem

Our data platform has grown on-premises over many years. The business has decided to migrate everything to the cloud, and there are over 60 production pipelines with complex inter-pipeline dependencies. Design the migration architecture.

## Worked solution and explanation

### Why this problem exists in real interviews

Migrating 60 production pipelines to the cloud without missing a 6am business-report SLA, with inter-DAG dependencies and per-pipeline rollback. The trap is migrating an upstream before its downstream and breaking consumers, or scheduling high-risk changes mid-week.

The default reach is to migrate pipeline by pipeline as the team gets to them. The first time an upstream is migrated before its downstream, the downstream's input format changes and the morning report breaks. A bad cutover happens midweek and rolling back takes most of a day; a different pipeline's morning is missed. Some pipelines have no parallel run because the team trusts the migration after spot checks.

> **Trick to Solving**
>
> Phased migration in dependency order, weekend-only high-risk changes, per-pipeline rollback, dual-run with parity gating.
> 
> 1. The orchestrator runs pipelines on cloud or on-prem during the migration; cutover happens in dependency order so a downstream is never migrated before its upstream.
> 2. High-risk changes (cutover) happen only on weekends; weekday changes are restricted to lower-risk operations.
> 3. Per-pipeline rollback flips the canonical pointer for that pipeline back to on-prem; the rest don't move.
> 4. Each migrating pipeline dual-runs with parity gating before cutover.

---

### Walk the requirements

#### Step 1: Morning reports run through the migration window

The orchestrator schedules each pipeline on whichever side is authoritative on a given day with one 6am SLA view. Sensors fire ahead of the deadline if any side is at risk; on-call has hours to recover. High-risk changes (cutover, rollback) are restricted to weekends so a Sunday-night issue doesn't take Monday's reports out. Without the orchestration layer there's nothing watching the deadline across both sides.

#### Step 2: Migrate in dependency order; downstreams never miss their inputs

Many DAGs read outputs from other DAGs. The migration plan walks the dependency graph and migrates upstreams before downstreams; a downstream's input shape doesn't change until the upstream's cutover is complete. A 'whichever pipeline is easiest first' approach is the version where a downstream's input changes mid-week and the morning report breaks; dependency-ordered cutover is the contract that prevents it.

#### Step 3: Per-pipeline rollback so a misbehaving pipeline reverts alone

When a migrated pipeline misbehaves after cutover, the per-pipeline rollback flips the canonical pointer for that pipeline back to on-prem. The other migrated pipelines stay on cloud. The rollback is hours, not days, because it's a routing change, not a re-migration. A 'big bang rollback' is the version where one bad pipeline drags the rest back; per-pipeline rollback is what isolates the failure.

---

### The shape that fits

> **What this design gives up**
>
> Phased migration takes longer than parallel migration; weekend-only high-risk changes constrains the change window; per-pipeline rollback requires per-pipeline routing to be a thing the orchestrator manages. Implementation cost is the price; the win is morning reports that run through the migration, downstreams that don't break on upstream cutover, and per-pipeline rollback in hours.

> **What reviewers check**
>
> A reviewer looks at the canvas for these properties:
> - An orchestration layer schedules pipelines on cloud or on-prem during the migration with one SLA view.
> - Cutover happens in dependency order; an upstream is never migrated before its downstream.
> - Per-pipeline cutover and rollback so a misbehaving pipeline reverts to on-prem without disturbing the others.
> - High-risk changes happen on weekends only.

> **The mistake that ships**
>
> What gets shipped migrates pipelines as the team gets to them, schedules cutovers mid-week, and rolls back the entire migration when one pipeline misbehaves. An upstream gets migrated before its downstream and the downstream's morning report breaks. A Tuesday cutover gone wrong takes Wednesday's reports out across multiple downstreams. The eventual rebuild adds dependency-ordered cutover, weekend-only high-risk changes, and per-pipeline routing for fast individual rollback.

---

## Common follow-up questions

- Two pipelines depend on a third that's still on-prem during their cloud cutover. What does this design do? _(Tests whether the candidate sees the orchestrator's DAG spanning both sides; the cloud pipelines read the on-prem upstream through the routing layer until the upstream is also migrated. The routing layer resolves cross-side reads; the dependency order means the upstream cuts over first.)_
- A migrated pipeline's parity diff has been clean for two weeks but spikes once on a quarter-end run. How does this design respond? _(Tests whether the candidate sees the parity gate as a recurring check: the diff spike halts cutover or triggers rollback if already cut over, the team investigates whether the spike is a legacy bug or a new bug, and the gate stays open until the discrepancy is explained or accepted.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/the_migration_that_cannot_break_morning)
- [System Design Interview Questions](https://datadriven.io/data-engineering-system-design)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.