# The Provider That Sometimes Sleeps

> The models run at dawn. The data has to be there first.

Canonical URL: <https://datadriven.io/problems/the_provider_that_sometimes_sleeps>

Domain: Pipeline Design · Difficulty: medium · Seniority: L5

## Problem

Our quantitative research team depends on daily price and volume data pulled from an external provider. The data feeds backtesting and risk models, and the current manual pull process has caused missed trading sessions. Design a reliable, automated ingestion pipeline.

## Worked solution and explanation

### Why this problem exists in real interviews

A daily pull from a paid provider with 7am quant deadlines, multi-hour outages a few times a year, and a licensing requirement that raw files be retained unchanged. The trap is a naive retry loop that burns the daily budget during an outage or treating retention as 'we'll keep it somewhere.'

The default reach is a manual pull script that runs at midnight and a retry loop on failure. The first multi-hour outage burns the daily request budget before resolving; the next morning's models miss data. Raw files get overwritten when the team reruns the pull because retention wasn't an explicit boundary.

> **Trick to Solving**
>
> Orchestrated pull with bounded retries and exponential backoff, raw files stored immutably for audit, alerting before 7am if anything is at risk.
> 
> 1. The orchestrator schedules the pull and applies bounded exponential-backoff retries; once the daily budget is exhausted the retry stops and pages on-call.
> 2. Each pulled raw file lands in immutable cold storage with the licensing-required retention; nothing overwrites it after ingestion.
> 3. Sensors fire ahead of the 7am deadline if the file isn't ready so on-call has hours, not minutes.

---

### Walk the requirements

#### Step 1: Pull on schedule with bounded retries that respect the budget

The orchestrator runs the daily pull on a schedule. Failures retry with exponential backoff up to a budget; once the budget is exhausted the orchestrator pages on-call rather than burning more requests on a provider that's down. A naive retry loop is the version where a multi-hour outage drains the daily budget on retries before resolving; the bounded retry is what keeps the next day's pull intact.

#### Step 2: Raw files land immutably for audit

Each pulled file writes to cold storage with the licensing-required retention; the storage policy denies modification or deletion. A rerun of the pull writes a new versioned file; the original stays. Without immutability the audit can't prove the file as received; the immutable archive is the contract.

#### Step 3: Alert before 7am, not at 7am

Sensors fire before 7am if the daily file hasn't arrived or hasn't loaded. On-call has hours to chase the provider, fire a manual pull, or escalate. A 'we'll find out at 7am when models break' approach is the version where the quant team is first to notice; sensors ahead of the deadline give the team a window to act.

---

### The shape that fits

> **What this design gives up**
>
> Bounded retries mean a long outage gives up rather than burning the budget; immutable retention costs storage proportional to the licensing window; the orchestrator's sensors and alerts are infrastructure to operate. Implementation cost is the price; the win is models that run on the prior day's data, an audit answer that points at unchanged files, and on-call seeing trouble before quants do.

> **What reviewers check**
>
> A reviewer looks at the canvas for these properties:
> - An orchestration layer schedules the daily pull with bounded retries and backoff so a multi-hour outage doesn't burn the request budget.
> - Raw files land in immutable cold storage; nothing modifies or deletes them after ingestion.
> - Sensors fire before the 7am deadline if the file is at risk.

> **The mistake that ships**
>
> What gets shipped runs a manual pull with a naive retry loop. A multi-hour provider outage burns the daily budget on retries; the next morning's models miss data. Raw files get overwritten on rerun. Quants find out about the missing data at 7am. The eventual rebuild adds bounded retries, the immutable archive, and pre-deadline alerting.

---

## Common follow-up questions

- The provider has a documented multi-hour maintenance window. How does this design handle a known outage without firing alerts every minute? _(Tests whether the candidate sees that scheduled maintenance windows are a config the orchestrator reads; retries pause during the window and resume on a single alert if the post-window pull still fails. The orchestrator distinguishes scheduled outages from unexpected ones.)_
- The audit team asks for the file received on a date six months ago. Where does this design point them? _(Tests whether the candidate sees the raw archive's per-day immutability: a query by date returns the file as received. The warehouse's loaded data may have been transformed; the archive answers 'what did we get' separately from 'what did we load.')_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/the_provider_that_sometimes_sleeps)
- [System Design Interview Questions](https://datadriven.io/data-engineering-system-design)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.