# Two Sources of Truth

Canonical URL: <https://datadriven.io/problems/two-sources-of-truth>

Domain: Pipeline Design · Difficulty: medium · Seniority: mid

## Problem

A retail company is moving its on-prem orders database to a cloud warehouse, and analytics teams cannot lose access to current data during the months-long cutover. Design a pipeline that does the one-time bulk load of the existing 2 TB, keeps the warehouse within minutes of the live source as new orders land, and proves the two systems agree before anyone flips reporting over.

## Worked solution and explanation

### Why this problem exists in real interviews

Migration reads like 'copy the data once,' and that is the trap. The source is a moving target: 5 million rows change a day while a 2 TB dump runs for hours, so the snapshot is stale the moment it finishes. The real skill being probed is whether you split the load into a bulk backfill plus a change-capture stream, stitch them at a known log position, and refuse to cut over until the two systems are proven equal. Get the stitch wrong and you silently lose or double every order written during the load window, and nobody notices until finance reconciles the quarter.

The whiteboard answer is a nightly full reload of the orders table into the warehouse until the day you flip reporting. It works in the demo and falls apart at 2 TB: the reload takes longer than the night, dashboards trail by a full day, and on cutover day there is no evidence the warehouse actually matches the source. The fix is two paths sized for two jobs, reconciled before anyone trusts them.

> **Trick to Solving**
>
> Bulk-load once from a frozen snapshot, then let change capture carry everything after that exact point.
> 
> 1. Take the initial snapshot at a recorded write-ahead-log position. That position is the seam.
> 2. Start change capture from that same position so no row is missed and none is replayed into a gap.
> 3. Apply changes as idempotent upserts keyed on the order id, so a redelivered event is harmless.
> 4. Reconcile against the source as-of a frozen position, because comparing against a live table never converges.

---

### Walk the requirements

#### Step 1: Bulk-load the history once, off a recorded log position

A batch job reads the 2 TB into a staging area in the lake, notes the write-ahead-log position the snapshot was consistent as-of, then a load step moves the staged history into the warehouse. Streaming 2 TB row by row through change capture would take days and hammer the source; the bulk path exists precisely because the backfill is a different job from the steady state. The recorded position is what makes the next step safe.

#### Step 2: Stream every change after the seam, straight into the warehouse

Change capture reads the source's write-ahead log starting from the snapshot position and applies inserts and updates directly to the warehouse as upserts on the order id. Because it starts exactly where the snapshot ended, nothing between 'dump finished' and 'stream started' falls through a hole, and a duplicated event just overwrites a row with identical data. This path is what keeps the warehouse within minutes of live without a nightly reload.

#### Step 3: Prove parity before cutover

A reconciliation job compares source and warehouse row counts and per-day order totals as-of a frozen log position. Comparing against the live source is a moving target that never reads zero, so you freeze a position, let the stream catch up to it, then diff. Cutover is gated on that diff being empty or fully explained. Skip this and you are betting the migration on faith.

---

### The shape that fits

**The seam, the stream, and the gate**

```python
# 1) Bulk load: snapshot the 2 TB at a RECORDED log position.
#    That position is the seam between backfill and stream.
seam_lsn = source.current_wal_lsn()            # remember where the snapshot is consistent as-of
snapshot = source.export_snapshot(at=seam_lsn)  # consistent read, does not block writers
staging_lake.write_parquet(snapshot)            # land history in the lake
warehouse.bulk_load(staging_lake.path)          # one-time load of the backfill

# 2) Incremental: start CDC from the SAME position so no row
#    falls in a gap and none is replayed before the seam.
for change in source.stream_wal(start=seam_lsn):
    # idempotent upsert keyed on the order id: a redelivered
    # event overwrites a row with identical data, never duplicates.
    warehouse.upsert(
        table='orders',
        key='order_id',
        row=change.after,
        delete=(change.op == 'delete'),
    )

# 3) Parity: freeze a position, let the stream reach it, then diff.
#    Comparing against the live source never converges.
def cutover_is_safe(check_lsn):
    wait_until(lambda: warehouse.applied_lsn() >= check_lsn)
    src = source.aggregate_as_of(check_lsn)   # row counts + per-day order totals
    wh = warehouse.aggregate()
    diff = reconcile(src, wh)                 # per-table, per-day
    return diff.is_empty()                    # gate cutover on an empty (or explained) diff

```

*The whole design hinges on one recorded log position shared by the snapshot and the stream, idempotent upserts, and a parity check frozen at a position the stream has reached.*

**Nightly full reload**

Re-extracts the whole table every night. Simple to write, but at 2 TB the job outruns the night, dashboards trail a full day, and cutover ships with no proof of correctness.

**Bulk plus change capture**

Backfills once, then streams deltas. More moving parts and a log-position seam to manage, but the warehouse stays minutes-fresh and reconciliation gives a hard go/no-go signal.

> **Interviewers Watch For**
>
> The tell of a senior answer is the seam: naming that the snapshot and the change stream must share one log position, and that changes apply idempotently. Candidates who hand-wave the handoff are the ones who lose rows in production.

> **Common Pitfall**
>
> Comparing the warehouse to the live source and waiting for the diff to hit zero. It never will, because the source keeps moving while you measure. Freeze a log position, let the stream reach it, then diff as-of that point.

> **In production**
>
> Keep the legacy source authoritative and the change stream running well past cutover. Reporting points at the warehouse only after parity holds and a soak period passes, and pointing back to the legacy system is a one-line routing change, not a restore.

---

## Common follow-up questions

- The source schema changes mid-migration: a column is added to orders. What does your change-capture path do, and where does it break? _(Tests whether the candidate has a schema-evolution story for CDC and the downstream warehouse table rather than a static mapping.)_
- Parity passes overall but one day's order total is off by a few cents in the warehouse. How do you decide whether to cut over? _(Tests triage of a non-empty diff: classify it as rounding, a transform bug, or a real loss, and gate cutover on explanation rather than a raw zero.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/two-sources-of-truth)
- [System Design Interview Questions](https://datadriven.io/data-engineering-system-design)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.