# The What-If Machine

> A million slots. A thousand campaigns. Every combination matters.

Canonical URL: <https://datadriven.io/problems/the_what_if_machine>

Domain: Pipeline Design · Difficulty: hard · Seniority: L5

## Problem

We run an ads platform and want to build a simulation system that matches ad inventory (slots) against ad campaigns to answer what-if questions about fill rates, reach, and frequency. Users should be able to configure a simulation, submit it, and explore the results later. A single configuration might run up to 1,000 simulations. Design the data pipeline behind this system.

## Worked solution and explanation

### Why this problem exists in real interviews

The hard part isn't the matching logic; it's the contract between a long-running compute job and a user who walks away. A configuration fans out into hundreds of independent simulations, each one takes minutes, and whoever submitted it comes back later expecting to compare them. The interesting design pressure is that compute, durability, and the user's session all live on different clocks.

First instinct is one big batch job: read the configuration, loop over the variants, write everything to a results table at the end. Looks clean. Then a single variant blows up, the whole job aborts, every successful variant before it disappears with the crash, and the wall-clock time scales with variant count instead of with whatever compute you can throw at it. The shape doesn't fit because the failure radius is wrong, not because the algorithm is.

> **Trick to Solving**
>
> If results have to outlive the run, store them per unit-of-work, not per job.
> 
> 1. Each variant is its own unit of work. A failure should affect only its own row in the result store, not the rest of the configuration.
> 2. Submission and compute live on different timelines. The user gets a job handle and reads the result store later; nothing is held open in between.
> 3. Comparison is a read pattern, not a write pattern. Lay the result store out so the cross-variant view is one query, not one job.

---

### Walk the requirements

#### Step 1: Run variants in parallel, not as one job

The wall-clock budget is set by the user, not by variant count. Treat each variant as its own task, schedule them onto independent workers, and let throughput scale with the worker pool. A thousand serial 2-minute runs is over thirty hours; the same thousand variants on a worker pool that scales out is bounded by how much compute you're willing to pay for, not by the variants themselves.

#### Step 2: Isolate per-variant results so one failure doesn't poison the rest

If you write all variants to the same row of one table at the end of the job, any failure throws away every successful variant. Write each variant to its own row in object storage like S3 or GCS, or to its own partition of a results table, as it finishes. Failed variants land as 'failed' rather than as a crash that takes the rest with them, and a retry only re-runs the bad one.

#### Step 3: Decouple submission from compute through a queue

The user submits a configuration, gets a job id, and reads the result store later. That's the contract. A queue between submission and the worker pool, anything from Kafka or Kinesis to a managed job queue, absorbs bursty submissions, makes the system restartable when a worker dies, and lets the user's path stay independent of the worker path.

---

### The shape that fits

> **What this design gives up**
>
> Per-variant results cost you a small storage premium and a slightly more complex result query. The aggregate view is now a fan-in over many records instead of a single materialised table. That's the cheaper of two prices: a serial job is faster to build but slower to run and brittle to a single bad input.

> **What reviewers check**
>
> A reviewer looks at the canvas for these properties:
> - A configuration of up to 1,000 variants is submitted and the results are read later.
> - Variant results have to outlive the compute that produced them so users can fetch and compare them later, and a failed variant can't take the others down with it.

> **The mistake that ships**
>
> The version that ships looks like 'we'll loop over variants in one job to keep the code simple, and write the aggregate at the end.' The first failure during a thousand-variant run wipes out hours of compute, the user reloads the dashboard, and the team adds a `--skip-failed-variants` flag that nobody trusts. The actual fix isn't a flag; it's that the result store should never have been a single end-of-job write.

---

## Common follow-up questions

- If one variant retries forever and never succeeds, how does the user see what's blocking their configuration? _(Probes how the system surfaces partial completion: failed-variant routing, retry budget, and the read shape that lets the user act on a stuck variant without rerunning the rest.)_
- What changes when one variant takes thirty seconds and another takes thirty minutes? _(Probes worker scheduling and how the slow variant doesn't block the queue or starve fast variants of capacity.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/the_what_if_machine)
- [System Design Interview Questions](https://datadriven.io/data-engineering-system-design)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.