# Five Years of Cron Jobs

> Half the jobs run on cron. Half run on events. All of it has to move.

Canonical URL: <https://datadriven.io/problems/five_years_of_cron_jobs>

Domain: Pipeline Design · Difficulty: hard · Seniority: L6

## Problem

We have a data platform that has grown organically over five years: some pipelines run on-premises as scheduled cron jobs, others are event-driven workflows triggered by upstream system callbacks. We need to migrate everything to a single cloud-native orchestration platform without disrupting the analytics team. Design the migration architecture and the logging and observability schema that supports it.

## Worked solution and explanation

### Why this problem exists in real interviews

A migration question where the migration is the architecture: the new orchestrator has to handle both scheduled and event-driven pipelines, dual-run alongside the legacy ones, gate cutover on output parity, and produce a queryable run log that audit can read for years. The trap is treating it as a one-time port and discovering the analytics team noticing a gap or auditors asking for a run that's still in a flat log file.

The default reach is to port pipelines one at a time, point analytics at the new ones when they look correct, and turn off cron after a quick check. The first migrated pipeline produces output that almost matches the legacy run; analytics team flags discrepancies a week in. Event-driven pipelines lose their trigger semantics because the new platform was set up for scheduled runs; an upstream callback that used to fire a cleanup job now waits until the next cron tick. Auditors ask for the lineage of a run from three months ago and the answer is grepping log files.

> **Trick to Solving**
>
> Dual-run with parity gate, preserve trigger semantics by class, structured run log to durable storage, cutover only after the diff is acceptable.
> 
> 1. Each migrating pipeline runs alongside the legacy one for a defined window; a daily diff job compares outputs and the orchestrator gates cutover on tolerance.
> 2. The new orchestrator handles both scheduled and event-driven triggers; event-driven pipelines keep their callback semantics rather than collapsing to a shared cadence.
> 3. Every run writes a structured record (run id, pipeline, start, end, status, inputs, outputs, error) to a durable run log retained for the regulatory window.
> 4. Cutover is config: when a pipeline's diff has been acceptable for the agreed window, the orchestrator switches the canonical pointer from legacy to cloud.

---

### Walk the requirements

#### Step 1: Dual-run during a defined window so analytics sees no gap

Each migrating pipeline runs in both the cloud orchestrator and the legacy system for a defined window (one to several weeks per pipeline depending on cadence). Both write to their respective output paths; a daily diff compares them. Analytics keeps reading the legacy outputs until cutover. A 'port and switch over the weekend' approach is the version where analytics flags discrepancies a week later because nobody compared run-by-run; dual-run with a diff is what makes the change invisible to analytics.

#### Step 2: Structured run log to durable storage, queryable for years

Every run (cloud and legacy during dual-run, cloud only after cutover) writes a structured record to a durable run log: pipeline id, run id, trigger type, start, end, status, input pointers, output pointers, error context. The log lands in cold storage with a queryable layer over it. When an auditor asks 'what ran last March 15 and what did it produce,' the answer is a SQL query, not a grep through text logs. Without the log, audit becomes a forensic exercise nobody finishes.

#### Step 3: Preserve event-driven semantics for the pipelines that need them

About a third of pipelines fire on upstream callbacks rather than on a schedule. The cloud orchestrator has to support event triggers (webhook, message, signal) so those pipelines fire when the upstream is ready, not when a cron tick fits. Collapsing all triggers to scheduled is the version where event-driven pipelines now lag by the polling interval, and downstream consumers expecting near-real-time behavior find out the migration changed their freshness budget.

#### Step 4: Cutover is config, gated on the parity diff

When a pipeline's daily diff has stayed within tolerance for the agreed window, the orchestrator flips the canonical pointer from legacy to cloud. Analytics consumers (which read from a logical path that resolves to one of them) start reading from the cloud output without changing their queries. A 'manual cutover Monday morning' approach is fragile; the gate gives every cutover a documented reason to flip, with the diff history queryable from the run log.

---

### The shape that fits

> **What this design gives up**
>
> Dual-run roughly doubles compute for the migration window. Event-trigger support adds a control plane the platform has to operate. The structured run log grows for years. Implementation cost is the price; the win is analytics that doesn't see a freshness gap, audit answers that come from a query, event semantics that don't get lost in translation, and cutover decisions backed by a numerical gate.

> **What reviewers check**
>
> A reviewer looks at the canvas for these properties:
> - An orchestration layer handles both scheduled and event-driven triggers in one platform.
> - Each migrating pipeline dual-runs with a daily parity diff against the legacy output and the cutover is gated on the diff being within tolerance.
> - Every run writes a structured record to a durable, queryable log retained for the regulatory window.
> - Cutover is a config flip after the parity gate is met, not a one-time manual port.

> **The mistake that ships**
>
> What gets shipped ports pipelines one by one and switches over after a quick spot-check. Analytics flags discrepancies a week after the first cutover; the team hasn't been comparing run-by-run. Event-driven pipelines lose their trigger semantics because the new platform was set up for cron; downstream consumers find out their freshness budget changed. An auditor asks for a run from three months ago and the answer is grepping through text logs that nobody can search across. The eventual rebuild adds dual-run, parity gating, event-trigger support, and a structured run log; each was reachable up front if the team had treated it as architecture rather than a one-time port.

---

## Common follow-up questions

- A migrated pipeline's diff has been within tolerance for two weeks but spikes once on a quarter-end run. How does this design handle the spike? _(Tests whether the candidate sees the parity gate as a recurring check, not a one-time snapshot: the diff spike halts cutover (or rolls back if already cut over), the team investigates whether the spike is a legacy bug or a new bug, and the gate stays open until the discrepancy is explained and either fixed or accepted with a documented reason.)_
- An auditor asks for the input data of a run from two years ago. Where does the design point them, and what's not stored that the auditor might have expected? _(Tests whether the candidate sees the run log as a record of what ran (pointers to inputs and outputs at the time) rather than a copy of the data. The actual input data lives in cold storage governed by its own retention; the log lets the auditor find it but doesn't itself store it.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/five_years_of_cron_jobs)
- [System Design Interview Questions](https://datadriven.io/data-engineering-system-design)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.