# The Whiteboard Exercise

> Marker in hand. Draw the whole thing.

Canonical URL: <https://datadriven.io/problems/the_whiteboard_exercise>

Domain: Pipeline Design · Difficulty: medium · Seniority: L5

## Problem

We need to build a pipeline that reads from multiple source systems, transforms the data, and loads it into a warehouse to power a daily dashboard. Design the ETL pipeline end-to-end.

## Worked solution and explanation

### Why this problem exists in real interviews

An L5 generic ETL with three properties to fit: an 8am dashboard deadline, query patterns analysts use that a raw transaction table can't serve fast, and a partner CSV that occasionally changes layout without notice. The trap is loading raw transactions and trusting a downstream view, or letting the load fail when the layout shifts.

The default reach is one nightly job that loads raw transactions into a flat table. Analysts open at 8am and slice by customer, product, channel, and date; queries scan the full table and return slowly. The partner ships a file with extra columns; the loader fails and the 8am deadline slips.

> **Trick to Solving**
>
> Pre-aggregated dimensional model for the analyst slices, partner schema-drift handling that doesn't fail the load, orchestrator owns 8am.
> 
> 1. Transformations roll raw transactions into a star (or wide aggregate) keyed on the dimensions analysts slice by; interactive queries scan a slice not the full table.
> 2. Partner schema-drift handling: extra columns are accepted (additive), renames or missing columns alert and quarantine the file rather than failing the load.
> 3. An orchestrator gates the 8am SLA with sensors firing before the deadline.

---

### Walk the requirements

#### Step 1: Land yesterday's data by 8am with alerting before

An orchestrator schedules the nightly DAG with sensors firing before 8am if any stage is at risk. Analysts open the dashboard with yesterday's data ready. Without orchestration nothing watches the deadline; without a warehouse the analyst-facing model has nowhere to live.

#### Step 2: Pre-aggregated model so analyst slices scan a slice

Analysts slice transactions by customer, product, channel, and date. The transformation rolls raw transactions into a model with rollups by those dimensions; interactive queries hit the rollups and return fast. A 'query the raw transaction table' approach is the version that scans the full table on every report; the model is what makes the slices interactive.

#### Step 3: Partner schema drift: additive accepted, others quarantine

Partner files occasionally arrive with extra columns or changed headers. The loader's contract: extra columns are accepted as nullable (additive); renames or missing columns quarantine the file with an alert rather than failing the load. The team triages the change, updates the mapping, and replays. A 'fail the load on any schema difference' approach is the version where the 8am deadline slips on every partner change; the additive-handling-plus-quarantine is the contract.

---

### The shape that fits

> **What this design gives up**
>
> The pre-aggregated model is more storage and a transformation step the team has to maintain; schema-drift handling needs the loader to know the registered schema; the orchestrator's per-stage SLA is config to maintain. Implementation cost is the price; the win is interactive analyst slices, partner additive changes that don't break the load, and the 8am deadline owned by the orchestrator.

> **What reviewers check**
>
> A reviewer looks at the canvas for these properties:
> - An orchestration layer schedules the nightly load with sensors firing before the 8am dashboard SLA.
> - A warehouse holds the analyst-facing model with pre-aggregated rollups for the common slices.
> - Partner schema drift handles additive changes gracefully and quarantines unrecognized changes with an alert.

> **The mistake that ships**
>
> What gets shipped runs one nightly load into a flat transaction table and lets every analyst report scan everything. Queries are slow; the partner ships a file with extra columns and the loader fails; 8am slips. The eventual rebuild adds the dimensional model, schema-drift handling, and SLA-aware orchestration.

---

## Common follow-up questions

- A new analyst slice (customer + region + week) becomes routine. What in this design lets it perform without scanning the full table? _(Tests whether the candidate sees the dimensional model as the surface for new slices: a new rollup key added to the model, the existing rollups untouched. The transformation is the place where new slices land; the warehouse stays partitioned by date with cluster keys for the common dimensions.)_
- The partner ships a file that's mostly the same but adds a new column with data the team wants. How does this design absorb it? _(Tests whether the candidate sees the additive contract: the new column flows through the loader as nullable, lands in the staged lake, and the team adds it to the dimensional model when ready. The 8am load doesn't fail; the new column appears in dashboards on the next deploy.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/the_whiteboard_exercise)
- [System Design Interview Questions](https://datadriven.io/data-engineering-system-design)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.