# Prove the Number Is Right

> Bad data in fintech is not just messy. It is expensive.

Canonical URL: <https://datadriven.io/problems/prove_the_number_is_right>

Domain: Pipeline Design · Difficulty: hard · Seniority: L6

## Problem

We're a personal finance platform. Customers connect their bank accounts and we show them a unified view of their spending. The data comes from dozens of partner integrations and our compliance team needs to be able to prove the numbers are accurate. Design the data pipeline.

## Worked solution and explanation

### Why this problem exists in real interviews

Aggregating from dozens of partners with different freshness, different reliability, and a compliance team that wants the numbers proven, not asserted. The trap is treating compliance as something that happens after the pipeline runs (a wiki page, a manual reconciliation) instead of as part of the pipeline. By the time the wiki gets updated, the discrepancy has been live for a month.

The simple answer is one ingestion path that pulls every partner, lands transactions, and updates balances on the same cadence. Customers see balances slowly. A partner falls behind and the whole pipeline stalls behind them. Compliance asks 'why is partner X under by some amount this week' and nobody can answer because nothing was comparing pipeline totals to partner-reported totals as the data flowed. SEC asks for validation history and the team has the latest run but not the seven-year trail.

> **Trick to Solving**
>
> Per-partner ingest with reconciliation as a gate, two freshness tiers off one source, every validation result archived for the regulator.
> 
> 1. Per-partner ingest paths so one partner's failure parks behind itself; healthy partners keep flowing.
> 2. Two cadences: a streaming path for balance updates, a batch path for transaction history.
> 3. Reconciliation against partner-reported totals is a gate inside the pipeline. Each batch compares ingested totals to partner totals and flags if outside tolerance.
> 4. Every validation result, pass or fail, writes to a queryable archive with the seven-year retention window. SEC asks the archive, not the runbook.

---

### Walk the requirements

#### Step 1: Balances within minutes, history hours

Balance updates flow through a streaming path that lands them in the app within minutes; transaction history flows through a batch path that loads to the warehouse on a slower cadence. Both feeds come off the same partner integrations, but consumed at different speeds. Customers see balances quickly without paying streaming compute on every historical row; the warehouse has time to do the heavier transformations. Without a warehouse tier the unified view has nowhere to live; without two cadences either balances are slow or every transaction pays streaming prices.

#### Step 2: Reconcile per batch against partner-reported totals

Compliance has to prove the numbers. Every per-partner batch runs a reconciliation step that compares ingested totals to the partner's reported totals; if the gap exceeds tolerance, the batch is flagged and the publish halts pending review. The reconciliation is part of the pipeline, not a post-hoc spreadsheet. Without the gate, discrepancies stay hidden until somebody complains, which is the compliance failure the requirement is calling out.

#### Step 3: Per-partner ingest so one slow partner doesn't break the others

Each partner has its own ingest path with its own sensor and its own progress. When one partner's feed is late, the orchestrator alerts on that partner; healthy partners keep flowing into the warehouse and the streaming layer. Affected customers see their last known balance (the streaming-fed store retains it) with a freshness indicator rather than an empty page. A shared queue across partners is the version where one slow partner blocks every customer's balance.

#### Step 4: Validation results archived for seven years, queryable

Every validation outcome, pass or fail, writes a row to an immutable archive with the partner id, batch id, expected and actual totals, tolerance applied, and the result. Retention is the seven-year SEC window. When SEC asks for the validation history of a partner over any window, the answer is a query on the archive. Without the archive, 'prove the numbers were checked' becomes 'we promise they were,' which is the audit response that fails.

---

### The shape that fits

> **What this design gives up**
>
> Per-partner ingest is more orchestration than one shared loader. Reconciliation as a gate halts the publish on partner discrepancies, which means more on-call attention than a quiet mismatch. The validation archive grows linearly with batches over seven years. Operational complexity is the cost; the win is balances that land in time, compliance answers that come from a query, and SEC validation history that exists.

> **What reviewers check**
>
> A reviewer looks at the canvas for these properties:
> - Per-partner ingest paths so one partner's failure stays isolated.
> - A streaming path serves balances within minutes; a batch path serves the transaction warehouse on a slower cadence.
> - A reconciliation gate per partner compares ingested totals to partner-reported totals before publish.
> - Every validation result writes to an archive retained for the SEC retention window.

> **The mistake that ships**
>
> What gets built first uses one shared loader pulling all partners, no reconciliation in the pipeline, and treats validation as a manual quarterly exercise. A partner's feed is late and every customer's balance is stale. A partner's totals start drifting from what the pipeline reports and the gap goes undetected for weeks. SEC asks for the validation history and the team has reports but not records. The eventual rebuild is per-partner ingest, in-pipeline reconciliation, and a validation archive. The SEC finding lands before the rebuild does.

---

## Common follow-up questions

- A partner reports their totals daily but ingests in real-time; how does this design reconcile streaming data against a daily total? _(Tests whether the candidate sees the reconciliation as windowed: at the partner's reporting boundary (daily cutover), the gate sums the streamed-in transactions and compares to the partner's daily total. The streaming-fed balance store doesn't change; the warehouse is what gets gated.)_
- A reconciliation flag fires for a partner. What does the customer see, what does compliance see, and what does on-call do? _(Tests whether the candidate has a triage flow: customers continue to see streaming balances (which are independent of the warehouse reconciliation), compliance sees the flagged batch in the archive, on-call investigates the partner-side discrepancy. The pipeline doesn't auto-correct; it surfaces.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/prove_the_number_is_right)
- [System Design Interview Questions](https://datadriven.io/data-engineering-system-design)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.